Alberto Ventafridda
Written on

An interactive guide to x86-64 assembly - introduction

It’s often said that assembly language is complex. Most people are scared of it, everyone avoids it. After all, there’s a reason why high-level languages and compilers were invented, right?
But while it’s true that you would have a hard time writing a large project in assembly, the language itself is surprisingly simple. That’s because Assembly is the native language of the processor, and at it’s essence, all the processor does is moving data.

This guide is not about writing assembly; it’s about understanding the way data moves behind the scenes when you execute a program. We’ll use concrete examples for the x86-64 architecture, but these informations apply eveywhere and are foundamental knowledge for reverse engineering, binary exploitation, or just writing better code.

This is the first part of a series of interactive articles:

what is data?

Data is just bits, representing information. A sequence of bits can encode any kind of information, however this article will only focus on text and integers.

But before we talk about any kind of encoding, we have to introduce a new notation: The issue is that while circuits understand sequences of bits very well, humans don’t. For example, can you tell the difference between 1101010101111110 and 1101010101111110 ?

Show answer

Ok, the two sequences are identical, but I bet you couldn’t immediately see that.

In order to visualize binary data in a more human friendly way, we use hexadecimal numbers, which associate a number or a letter between A and F to a group of 4 bits.
A long sequence of bits can be represented in this way:

0010 0101 0111 1101 1111

 2    5    7    d    f

Note that in order to avoid confusion with decimal numbers, it’s common to prefix hexadecimal numbers with 0x. For example, 0x1234 is not the same thing as the decimal number 1234.
I’m not going to explain how conversions between decimal, binary, and hexadecimal numbers work, The only assumption i’m making in this article is that you know that.
If you have a python terminal, you can perform these conversions very easily:

al@thinkpad:~/$ python
>>>
>>> 0b0010 #print the binary number 0010 in decimal
2
>>> 0x1234 #print the hex number 1234 in decimal
4660
>>> hex(0b00100101011111011111) #print a binary number in hex
'0x257df'
>>> hex(4660) #print a decimal number in hex
'0x1234'

One more thing: we call a group of 8 bits a byte, but that’s not the only group of bits with a name. The following table contains all the names that you will encounter while working with the x86-64 architecture:

N. of bitsexample hex valuename
4fnibble
8ffbyte
16ffffword
32fffffffdword (double word)
64fffffffffffffqword (quadruple word)

text

There are a lot of different ways to encode text, and I recommend that you read the bare minimum foundamentals , it’s a very interesting topic in itself. In this article however we’ll only focus on ASCII encoding, which is extremely simple:

All you need to know is that text is stored as a seqence of bytes. every byte represents a character, so there are 127 possible characters between numbers, english letters and puctuation. You can find a table of all the ascii characters in the linux man pages.

For example, the letter ‘c’ is stored as the byte 0x63, The letter ‘o’ is 0x6f, The text ciao is stored as the sequence of bytes 63 69 61 6f.

where is data?

Now that we know how to represent text and numbers, we need some place to store them. Like all kind of data, we can store it in only two places:

  • in memory, which means in your RAM
  • in registers, which are special containers inside your CPU

memory

Memory is just a very long list of contiguous cells, each containing 8 bits of information, and reachable by a numeric address.

Since printing a long list of bytes would take a lot of space, when visualizing memory we usually group bytes in rows of 8 or 16. It’s also common to include a column to the side that shows the ascii letter associated to each byte.

The memory dump below was taken from a program that was running on my computer. Use the slider to adjust the number of bytes you wanto to show in a row.

showing 1 byte per row

00000000
00000001
00000002
00000003
00000004
00000005
00000006
00000007
00000008
00000009
0000000a
0000000b
0000000c
0000000d
0000000e
0000000f
00000010
00000011
00000012
00000013
00000014
00000015
00000016
00000017
00000018
00000019
0000001a
0000001b
0000001c
0000001d
0000001e
0000001f
00000020
00000021
00000022
00000023
00000024
00000025
00000026
00000027
00000028
00000029
0000002a
0000002b
0000002c
0000002d
0000002e
0000002f
00000030
00000031
00000032
00000033
00000034
00000035
00000036
00000037
00000038
00000039
0000003a
0000003b
0000003c
0000003d
0000003e
0000003f
00000040
00000041
00000042
00000043
00000044
00000045
00000046
00000047
00000048
00000049
0000004a
0000004b
0000004c
0000004d
0000004e
0000004f
00000050
00000051
00000052
00000053
00000054
00000055
00000056
00000057
00000058
00000059
0000005a
0000005b
0000005c
0000005d
0000005e
0000005f
00000060
00000061
00000062
00000063
00000064
00000065
00000066
00000067
00000068
00000069
0000006a
0000006b
0000006c
0000006d
0000006e
0000006f
00000070
00000071
00000072
00000073
00000074
00000075
00000076
00000077
00000078
00000079
0000007a
0000007b
0000007c
0000007d
0000007e
0000007f
00000080
00000081
00000082
00000083
00000084
00000085
00000086
00000087
00000088
00000089
0000008a
0000008b
0000008c
0000008d
0000008e
0000008f
00000090
00000091
00000092
00000093
00000094
00000095
00000096
00000097
00000098
00000099
0000009a
0000009b
0000009c
0000009d
0000009e
0000009f
000000a0
000000a1
000000a2
000000a3
000000a4
000000a5
000000a6
000000a7
000000a8
000000a9
000000aa
000000ab
000000ac
000000ad
000000ae
000000af
000000b0
000000b1
000000b2
000000b3
000000b4
000000b5
000000b6
000000b7
000000b8
000000b9
000000ba
000000bb
000000bc
000000bd
000000be
000000bf
000000c0
000000c1
000000c2
000000c3
000000c4
000000c5
000000c6
000000c7
000000c8
000000c9
000000ca
000000cb
000000cc
000000cd
000000ce
000000cf
000000d0
000000d1
000000d2
000000d3
000000d4
000000d5
000000d6
000000d7
000000d8
000000d9
000000da
000000db
000000dc
000000dd
000000de
000000df
6578616d706c652061736369692074657874000000000000e95155555555000040dcffff0100000058dcffffff7f00000000000000000000e804be1278e96fe058dcffffff7f0000e951555555550000987d55555555000040d0fff7ff7f0000e8041ca48716901fe8043428fd06901f00000000ff7f0000000000000000000000000000000000000000000000000000000000000000000000429e875dca2f7e0000000000000000409ec2f7ff7f000068dcffffff7f0000987d555555550000e0e2fff7ff7f0000000000000000000000000000000000000051555555550000
example ascii text.......QUUUU..@.......X...................x.o.X........QUUUU...}UUUU..@.................4(.............................................B..]./~........@.......h........}UUUU...........................QUUUU..

registers

Registers are containers for data, located inside your CPU. The x86-64 architecture has a lot of registers, each with an associated name. Some of them have a specific purpose, other are generic containers we can use in our program. We mostly interact with these:

07 06 05 04
03 02
01 00
rax
eax
ax
ah al
rbx
ebx
bx
bh bl
rcx
ecx
cx
ch cl
rdx
edx
dx
dh dl
rsi
esi
si
sil
rdi
edi
di
dil
rsp
esp
sp
spl
rbp
ebp
bp
bpl
07 06 05 04
03 02
01 00
r8
r8d
r8w
r88
r9
r9d
r9w
r98
r10
r10d
r10w
r108
r11
r11d
r11w
r118
r12
r12d
r12w
r128
r13
r13d
r13w
r138
r14
r14d
r14w
r148
r15
r15d
r15w
r158

In order to understand these tables, we’ll look at the register rax, displayed in the first row. rax is a generic register that contains 8 bytes of data: from byte 0 to byte 7 as indicated by the byte numbers at the top of the table.

The register eax gives you access to the lower 4 bytes of rax; reading or writing into eax is the same as reading or writing the bytes from 0 to 3 of rax.
Similarly, ax gives you access to the lower 2 bytes, and al to the lowest byte.

Finally, some code

We are assuming that you are familiar with some programming language, it doesn’t matter which one. Assembly code syntax is similar to the programming language concepts you know: a sequence of instructions, usually one on every line, that will be executed in order.

The x86-64 assembly syntax has two different dialects: AT&T and Intel. All the code snippets in this series of articles are using the Intel syntax. The following snippet is an example of how the syntax looks like, don’t worry about what it does for now.

# this is a comment
push    rbp
mov     rbp, rsp
mov     DWORD PTR [rbp-4], edi
mov     eax, DWORD PTR [rbp-4]
add     eax, 0x42
pop     rbp
ret

A good way to familiarize yourself with the syntax is to look at the assembly generated from small snippets of code. The compiler explorer website is designed exactly for this use case: You can type snippets of code in any compiled language you know, and observe the generated assembly. If you hover the mouse over an assembly instruction you can even see a description of what it does.

In the next article we are going to see in details how each of the instruction in the previous example works

Further Reading

This article is still under development, and it’s improving over time.
If you reached this point, you might be interested in the next articles:

Additional resources: