A few notes on assembly

19 Nov 2011

Quick navigation within this page

Introduction

Assembly language is a very low level programming language. Many higher programming language compilers translate their code into assembly and then into machine code. Although almost no-one develops software exclusively in assembly it is still very useful to be able to understand the language. For example with the gnu c compiler gcc one can inspect the assembly code using the -S flag and the result will be stored in a file with the extention .s.
$ gcc -Wall -O2 -S -masm=intel prog.c	# generates assembly prog.s
$ gcc prog.s -o prog			# produces executable	
For the windows compiler the option is /FA, i.e. cl /O2 /FA prog.cpp writes the assembly code into prog.asm and generates the executable. By default gcc generates assembly code using the AT&T syntax whereas cl generates Intel syntax. Since the Intel syntax is described more widely and also somewhat easier to work with we use the flag -masm=intel to get gcc producing Intel syntax as well.

The main concepts in assembly are registers and instructions, where registers are very small (e.g. 64 bit on a 64-bit cpu, sse registers are 128 bit) but very fast memory locations integrated into the cpu. Each register normally only stores one variable. Let's have a look at a simple (and rather pointless) loop in c++:

$ cat loop.cpp
double loop(double a, int n){
   double sum=0.0;
   for(int i=1; i<n; i++) {
      sum+=a;
   }
   return sum;
}
This loop translates into the assembly code below. We use sse2 floating point instructions (default on 64-bit systems), where on older systems (32-bit) x87 instructions would be generated by default (even if sse2 was available).
$ g++ -Wall -O2 -msse2 -mfpmath=sse -S -masm=intel loop.cpp
$ cat loop.s
...
.L4:
	add	eax, 1		# i+=1;
	addsd	xmm0, xmm1	# sum+=a;
	cmp	eax, edi	# compare i, n
	jne	.L4		# conditional jump to .L4
...
A non-optimised version would not necessary load all variables into registers and so require lots of memory access which is slow:
$ g++ -Wall -msse2 -mfpmath=sse -S -masm=intel loop.cpp
$ cat loop.s
...
.L3:
	movsd	xmm0, QWORD PTR [rbp-16]
	addsd	xmm0, QWORD PTR [rbp-24]
	movsd	QWORD PTR [rbp-16], xmm0
	add	DWORD PTR [rbp-4], 1
.L2:
	mov	eax, DWORD PTR [rbp-4]
	cmp	eax, DWORD PTR [rbp-28]
	setl	al
	test	al, al
	jne	.L3
...
For larger programmes it will be harder to find the assembler code corresponding to particular parts of the C++ source, so it can be a good idea to add markers to the assembly, for example like this:
$ cat test.cpp
...
#define ASM_COMMENT(X)  asm volatile ("#" X)
int main() {
...
   ASM_COMMENT("start loop");
   for(int i=0; i<n; i++) {
...

References:

8080 8-bit instruction set

registers
general purpose 8-bit A, B, C, D, E, H, L
16-bit BC, DE, HL: limited 16-bit operations
program registers 16-bit SP: stack pointer, points to stack memory
PC: program counter, memory address of next instruction

8086 16-bit instruction set (x86-16)

The term x86 refers to the instruction set introduced by the Intel 8086 and is used up to the Intel 80286 cpu. For backwards compatibility reasons even most modern Intel and AMD processors contain that 16-bit instruction set as a subset.

General purpose registers as well as index registers (pointing to memory locations) are 16-bit wide. As a consequence only 216 = 64KB of memory can be directly addressed. However, the 8086 and 80286 have a 20-bit and 24-bit external address bus and can address 220 = 1MB and 224 = 16MB of main memory, respectively. This is achieved using a cumbersome mechanism of memory segmentation and is also the reason why segment registers are needed.

registers
general purpose 16-bit AX: accumulator
BX: base index, e.g. for arrays
CX: counter
DX: data, general
these are just a recommendations, registers can be used for anything
8-bit AX=(AH, AL), BX=(BH, BL), ...
index registers 16-bit SP: stack pointer, contains main memory address of the top of the stack
BP: base pointer, used to point to some other place in stack, typically above the local variables
SI, DI: source, destination index, used to point to arrays (e.g. strings)
segment register 16-bit CS: code segment
DS: data segment
ES: extra segment
SS: stack segment
status register 16-bit one register which is implicitly set and read by instructions
instruction pointer 16-bit IP: points to the next instruction in main memory
instruction pseudo code description
mov ax, bx
mov ax, 12
mov ax, [sp]
ax = bx;
ax = 12;
ax = *sp;
copies contents
lea ax, [sp]
lea ax, [sp+si]
ax = &(*sp) = sp;
ax = &(*(sp+si))=sp+si;
load effective address, returns the address of memory location,
note, lea ax, [dx+8] is short for mov ax, dx and add ax, 8
add ax, bx ax += bx; integer addition
sub ax, bx ax -= bx; integer subtraction
inc ax ... dec ax ax++; ax--; short for add x, 1 and sub x, 1
imul ax, bx ax *= bx; signed int multiply
idiv ax, bx ax /= bx; signed int divide
mul unsigned int multiply
div bx ax=[dx,ax]/bx, dx=[dx,ax]%bx unsigned int divide, ax contains the quotient and dx the raminder
and ax, bx ax = ax & bx; bitwise logical and
or ax, bx ax = ax | bx; bitwise logical or
xor ax, bx ax = ax ^ bx; bitwise logical xor, e.g. xor ax, ax means ax = 0
not ax ax = ~ax; bitwise logical not
sar ax, 3 ax = ax >> 3; signed int bitwise shift right
sal ax, 3 ax = ax << 3; signed int bitwise shift left
shl, shr unsigned int version of shift operators
cmp ax, bx tests ax and bx of equality, internally using sub without
overwriting ax but only setting the status/flag register
jmp .label jump to location .label in assembly code
je .label
jg .label
jl .label
jge .label
jle .label
jne .label
conditional jump to .label based on the values in the status/flag register;
normally follows the cmp instruction, and the jump is executed
if the comparison resulted in equal (e), greater (g), less (l),
not equal (ne), or combinations of it, respectively
pop ax ax = stack.rmtop(); moves the top value of the memory stack into ax
push ax stack.add() = ax; saves the value of ax on top of the memory stack
test
BYTE : 1 byte (8 bit)
WORD : 2 byte (16 bit)
DWORD : 4 byte (32 bit)
QWORD : 8 byte (64 bit)
TBYTE : 10 byte (80 bit)
these are type specifiers; since for example [si] only points
to the beginning of a memory address and in the absence of type
specifiers we don't know the bit-length of [si];
if registers are involved we implicitly know the bit-length:
mov ax, [si] : use 16-bit
mov al, [si] : use 8-bit
if registers are not involved we need to specify the data type:
mov WORD [si], 0 : use 16-bit
References:

80386 32-bit instruction set (x86-32)

Also known as IA-32 and i386, the x86-32 extends all registers to 32-bit and adds a few new instructions but keeps all the old x86-16 instructions.
registers
general purpose 32-bit EAX, EBX, ECX, EDX
index registers 32-bit ESP: stack pointer, contains main memory address of the top of the stack
EBP: base pointer, used to point to some other place in stack, typically above the local variables
ESI, EDI: source, destination index, used to point to arrays
status register 32-bit one register which is implicitly set and read by instructions
instruction pointer 32-bit EIP: points to the next instruction in main memory
References:

AMD64 64-bit instruction set (x86-64)

The x86-64 has been developed by AMD and is also known as AMD64 and Intel 64. Note, the Intel Itanium (IA-64) has a different 64-bit instruction set.
registers
general purpose 64-bit RAX, RBX, RCX, RDX
R8, R9, ..., R15
index registers 64-bit RSP: stack pointer, contains main memory address of the top of the stack
RBP: base pointer, used to point to some other place in stack, typically above the local variables
RSI, RDI: source, destination index, used to point to arrays
status register 64-bit RFLAGS one register which is implicitly set and read by instructions
instruction pointer 64-bit RIP: points to the next instruction in main memory
References:

x87 fpu instruction set

registers
general purpose 80-bit ST(0), ST(1), ..., ST(7): arranged in a direct access stack structure, ST(0) is the top of the register stack
control/status 16-bit control register: for rounding/precision control and exception handling, access via fldcw and fstcw
status register: status of last fpu operation, incl. top of stack pointer, fp exceptions, read via fstsw
tag register: status of each register (valid, zero, special, empty)
program 48-bit instruction pointer:
data pointer:
instruction pseudo code description
fld QWORD [sp]
fld st(2)
fst QWORD [dp]
fstp QWORD [dp]
st(0) = *(sp);
st(0) = st(2);
*(dp) = st(0);
*(dp) = st(0);
loads floats (here QWORD=64-bit) values from memory to the fpu register stack (add to the stack, no overwrite)
store registers in mem, and store with pop, i.e. removed from fpu stack after store
Note:
  • conversion with rounding to and from internal 80-bit format are performed
  • no direct x86 to x87 register transfer
fild DWORD [sp]
fist DWORD [dp]
fistp DWORD [dp]
st(0) = *(sp);
*(dp) = st(0);
*(dp) = st(0);
(signed) integer version of load/store operations, here DWORD=32bit
to load an unsigned int, a conversion to signed int (with higher precision, e.g. 64bit) is necessary
fldz
fld1
fldpi
...
st(0) = 0.0;
st(0) = 1.0;
st(0) = M_PI;
...
loads special constants into register stack, all 80-bit accurate
fxch st(2) swap(st(0),st(2)); swaps values in registers
fadd st(2)
fadd QWORD [sp]
fadd st(4), st
st(0) += st(2);
st(0) += *(sp);
st(4) += st(0);
floating point add
fiadd DWORD [sp] st(0) += *(sp); add a signed integer from memory to stack register (80-bit float)
fsub st(1)
fsubr st(1)
fmul st(1)
fdiv st(1)
fdivr st(1)
st(0) -= st(1);
st(0) = st(1)-st(0);
st(0) *= st(1);
st(0) /= st(1);
st(0) = st(1)/st(0);
other basic operations, there are many more versions, including integer versions, pop versions, and two-operant versions
fsqrt st(0)=sqrt(st(0)) square root
f2xm1
fyl2xp1
fyl2x
st(0) = 2^st(0) - 1;
st(0) = log_2(st(0)+1)*st(1);
st(0) = log_2(st(0))*st(1);
only works for small values -1 < st(0) < 1, used for exp()
an additional pop operation is performed
an additional pop operation is performed, used for log()
fsin
fcos
fptan
fpatan
st(0) = sin(st(0));
st(0) = cos(st(0));
st(0) = tan(st(0));
st(0) = atan(st(1)/st(0));
only available since the 387
fcom st(1) compares st(0) with st(1) and stores the result in the fpu status register
fxam examines st(0) (unsupported, nan, normal, inf, zero, empty, denormal) and stores result in the fpu flag register
fldcw [sp]
fstcw [dp]
loads and stores the control register
fstsw [dp] stores the status register in main memory at location dp
Usage by gcc:
The gcc compiler emits x87-fpu instruction when compiling x86-32 bit code and sse-fpu instructions when compiling x86-64 bit code. This behaviour can be changed using different compiler flags:
$ gcc -m32			# x86-32 bit mode (x87-fpu is default)
$ gcc -mfpmath=387		# use x87-fpu instructions (sse default on x86-64)
$ gcc -ffast-math		# emits more x87-fpu instructions, because
				# -funsafe-math-optimizations is included
$ gcc -mpc32 / -mpc64 / -mpc80	# set internal x87-fpu precision to 32/64/80 bit
$ gcc -mfpmath=387 -msoft-float	# uses software emulation of fpu instructions
C function gcc -S -m32 gcc -S -m32 -funsafe-math-optimizations
sqrt() fsqrt fsqrt
exp() libmath.so using f2xm1, fldl2e, frndint, fscale, ...
log() libmath.so using fyl2x, fldln2
pow() libmath.so libmath.so
sin(), cos() libmath.so fsin, fcos
tan() libmath.so fptan, fld1
atan() libmath.so fpatan
acos() libmath.so using fpatan, fsqrt, fmul, fsubr
cosh() libmath.so libmath.so
acosh() libmath.so libmath.so
erf() libmath.so libmath.so
Notes: References:

MMX fpu instruction set

The MMX instruction set was introduced by Intel in 1996 on top of the Pentium processor to facilitate SIMD (single instruction, multiple data) instructions. It introduces new MM? 64-bit wide registers which can be packed with multiples of lower-bit integer data types and instruction are then executed on all of them at once.
registers
general purpose 64-bit MM0, MM1, ..., MM7: using physical registers of the x87 fpu, e.g. SP(?)=[1,...,1,MM0]
One register could either contain a 64-bit integer, two 32-bit integer, or four 16-bit integer and operations execute at the same speed independently of the contents, and therefore a speedup for 32-bit and lower integer operations can be expected if MMX registers are used. There are big performance penalties for switching between x87 fpu and mmx operations.

SSE fpu instruction set

The SSE instruction set is a further development of MMX to improve SIMD capabilities. It introduces dedicated 128-bit wide registers and extends operations to floating point numbers (only float's with the initial SSE version). The most important version is SSE2 as it implements working with double's and int's, so it makes MMX redundant and can also replace x87 operations.
registers
general purpose 128-bit XMM0, XMM1, ..., XMM7: dedicated registers
XMM8, XMM9, ..., XMM15: only on x86-64 bit systems
status register 32-bit MXCSR: control/status register
SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 The SSE unit can be used for vector operations (packed, i.e. when the register is filled with multiple of double's, for example) as well as scalar operations (only part of the register is filled with one double for example), and so can be used as a general purpose floating point unit. Most names have a suffix of the form [p/s][d/s], standing for packed/scalar and double/single precision, e.g.:
addpd add packed double
addps add packed single
addsd add scalar double
addss add scalar single
The table below only shows packed double precision versions and is also very incomplete.
instruction pseudo code C++ intrinsics description
movapd xmm0, xmm1
movupd xmm0, xmm1
xmm0=xmm1
xmm0=xmm1
__m128d _mm_load_pd(double* p) aligned packed double move
unaligned packed double move
addpd xmm1, xmm2
subpd xmm1, xmm2
mulpd xmm1, xmm2
divpd xmm1, xmm2
sqrtpd xmm1
xmm1 += xmm2
xmm1 -= xmm2
xmm1 *= xmm2
xmm1 /= xmm2
xmm1 = sqrt(xmm1)
packed double arithmetic operations
cmppd xmm1, xmm2 packed double compare, result stored where?
unpcklpd xmm1, xmm2 pack lower parts of xmm2 into upper part of xmm1
ldmxcsr DWORD [rsi]
stmxcsr DWORD [rdi]
mxcsr=*rsi
*rdi=mxcsr
unsigned int _mm_getcsr();
void _mm_setcsr(unsigned int i);
access the status register by moving it to/from a general 32-bit memory location
XMMWORD: 16 byte (128 bit) type specifier for memory locations
References:

AVX fpu instruction set

The advanced vector extensions (AVX) instruction set is a successor of SSE and is also backwards compatible. It increases the bit-width of the SSE XMM0-XMM15 registers from 128-bit to 256-bit, now called YMM0-YMM15. The XMM registers are still available but share the physical space of the lower 128-bits of the YMM registers, i.e. YMM0=[... XMM0].
registers
general purpose 256-bit YMM0, YMM1, ..., YMM15
instruction pseudo code C++ intrinsics description
vmovapd ymm0, ymm1
vmovapd ymm0, YMMWORD PTR [rsp]
vmovapd xmm0, xmm1
ymm0=ymm1
ymm0=*rsp
xmm0=xmm1
__m256d _mm256_set_pd()
__m256d _mm256_load_pd()
aligned packed double move
vaddpd ymm1, ymm2, ymm3
vsubpd ymm1, xmm2, ymm3
vmulpd ymm1, ymm2, ymm3
vdivpd ymm1, ymm2, ymm3
vsqrtpd ymm1, ymm2
ymm1=ymm2+ymm3
ymm1=ymm2-ymm3
ymm1=ymm2*ymm3
ymm1=ymm2/ymm3
ymm1=sqrt(ymm2)
packed double arithmetic operations
YMMWORD: 32 byte (256 bit) type specifier for memory locations
References:

CPU architecture

external bus cache
cpu year instruction data address clock speed L1 L2 circuit size transistors power notes
Intel 4004 1971 4-bit 4-bit 92.5 kHz 10 µm 2300 0.5W
Intel 8080 1974 8-bit 8-bit 16-bit 2 MHz 6 µm 6000 0.8W
Zilog Z80 1976 8-bit 8-bit 16-bit 4 MHz 8500 0.8W
Intel 8086 1978 x86-16 16-bit 20-bit 5-10 MHz 3 µm 29k 1.7-1.8W
Intel 80286 1982 x86-16 16-bit 24-bit 6-25 MHz 1.5 µm 134k 3.3W
Intel 80386 1985 x86-32 32-bit 32-bit 12-40 MHz 1-1.5 µm 275k 1.3-2.0W SX: 16-bit data bus, 24-bit address bus
Intel 80486 1989 x86-32, x87 32-bit 32-bit 16-100 MHz 8KB 0.6-1.0 µm 1.2m 3-4W SX has no x87 fpu
Pentium 1993 x86-32, x87 64-bit 32-bit 60-200 MHz 8+8KB 0.35-0.8 µm 3.1m 4-15W
Pentium MMX 1996 x86-32, x87 64-bit 32-bit 166-266 MHz 16+16KB 0.35 µm 4.5m 13-17W
Pentium II 1997 x86-32, x87 64-bit 36-bit 233-450 MHz 16+16KB 512KB 0.25-0.35 µm 7.5m 17-43W
Pentium III 1999 x86-32, x87, sse 64-bit 36-bit 400-1400 MHz 16+16KB 512KB 0.13-0.25 µm 9.5-28.1m 16-42W
Pentium 4 2000 x86-32, x87, sse2 64-bit 36-bit 1.3-3.8 GHz 16+16KB 512KB 65-180 nm 42-55m 50-115W later versions contained x86-64
Core 2 2006 x86-64, x87, ssse3 1.8-3.5 GHz 2-6MB 45-65 nm 65-130W duo: Conroe 65nm, Allendale 65nm, Wolfdale 45nm
quad: Kentsfield 65nm, Yorkfield 45nm
Core i5 / i7 Nehalem 2008 x86-64, x87, sse4 1.73-3.46 GHz 32+32KB 256KB 32-45 nm 82-130W dual: Clarkdale
quad: Lynnfield, Bloomfield
Core i5 / i7 Sandy Bridge 2011 x86-64, x87, sse4, avx 1.73-3.46 GHz 32+32KB 256KB 32-45 nm
Haswell 2013 x86-64, x87, sse4, avx2 14-22 nm
Skylake 10-14 nm
fpu year data address clock speed L1 cache circuit size transistors power
Intel 8087 1980 16-bit 16-bit 5-10 MHz 3 µm 45k 2.4 W
Intel 80287 1983 16-bit 16-bit 6-20 MHz
Intel 80387 1986 32-bit 32-bit 16-40 MHz 8KB 1.5 µm 120k
References:

CPU architecture drawings

Intel 4004 Intel 8080 Intel 8086 Intel 80286 Intel 80386
Intel 4004 Intel 8080 Intel 80286 Intel 80386
Intel 8087 Intel 80287 Intel 80387
Intel 8087 Intel 80387
Intel 80486 Pentium Pentium MMX Pentium 2 Pentium 3
Intel 80486 Pentium Pentium MMX
Pentium 4 Core Nehalem Sandy Bridge
core architecture nehalem architecture