An interactive guide to x86-64 assembly - introduction
It’s often said that assembly language is complex. Most people are scared of it, everyone avoids it.
After all, there’s a reason why high-level languages and compilers were invented, right?
But while it’s true that you would have a hard time writing a large project in assembly,
the language itself is surprisingly simple.
That’s because Assembly is the native language of the processor,
and at it’s essence, all the processor does is moving data.
This guide is not about writing assembly; it’s about understanding the way data moves behind the scenes when you execute a program. We’ll use concrete examples for the x86-64 architecture, but these informations apply eveywhere and are foundamental knowledge for reverse engineering, binary exploitation, or just writing better code.
This is the first part of a series of interactive articles:
- introduction (you are here)
- moving data
- stack frames
what is data?
Data is just bits, representing information. A sequence of bits can encode any kind of information, however this article will only focus on text and integers.
But before we talk about any kind of encoding, we have to introduce a new notation:
The issue is that while circuits understand sequences of bits very well, humans don’t.
For example, can you tell the difference between
1101010101111110
and 1101010101111110
?
Show answer
Ok, the two sequences are identical, but I bet you couldn’t immediately see that.
In order to visualize binary data in a more human friendly way, we use
hexadecimal numbers, which associate a number or a letter
between A and F to a group of 4 bits.
A long sequence of bits can be represented in this way:
0010 0101 0111 1101 1111
2 5 7 d f
Note that in order to avoid confusion with decimal numbers, it’s common to prefix
hexadecimal numbers with 0x
.
For example, 0x1234
is not the same
thing as the decimal number 1234
.
I’m not going to explain how conversions between decimal, binary, and hexadecimal numbers work,
The only assumption i’m making in this article is that you know that.
If you have a python terminal, you can perform these conversions very easily:
al@thinkpad:~/$ python
>>>
>>> 0b0010 #print the binary number 0010 in decimal
2
>>> 0x1234 #print the hex number 1234 in decimal
4660
>>> hex(0b00100101011111011111) #print a binary number in hex
'0x257df'
>>> hex(4660) #print a decimal number in hex
'0x1234'
One more thing: we call a group of 8 bits a byte
, but that’s not the only
group of bits with a name. The following table
contains all the names that you will encounter while working with the x86-64 architecture:
N. of bits | example hex value | name |
---|---|---|
4 | f | nibble |
8 | ff | byte |
16 | ffff | word |
32 | fffffff | dword (double word) |
64 | fffffffffffff | qword (quadruple word) |
text
There are a lot of different ways to encode text, and I recommend that you read the bare minimum foundamentals , it’s a very interesting topic in itself. In this article however we’ll only focus on ASCII encoding, which is extremely simple:
All you need to know is that text is stored as a seqence of bytes. every byte represents a character,
so there are 127
possible characters between numbers, english letters and puctuation.
You can find a table of all the ascii characters in the
linux man pages.
For example, the letter ‘c’ is stored as the byte 0x63
,
The letter ‘o’ is 0x6f
,
The text ciao
is stored as the sequence of bytes 63 69 61 6f
.
where is data?
Now that we know how to represent text and numbers, we need some place to store them. Like all kind of data, we can store it in only two places:
- in memory, which means in your RAM
- in registers, which are special containers inside your CPU
memory
Memory is just a very long list of contiguous cells, each containing 8 bits of information, and reachable by a numeric address.
Since printing a long list of bytes would take a lot of space, when visualizing memory we usually group bytes in rows of 8 or 16. It’s also common to include a column to the side that shows the ascii letter associated to each byte.
The memory dump below was taken from a program that was running on my computer. Use the slider to adjust the number of bytes you wanto to show in a row.
showing 1 byte per row
registers
Registers are containers for data, located inside your CPU. The x86-64 architecture has a lot of registers, each with an associated name. Some of them have a specific purpose, other are generic containers we can use in our program. We mostly interact with these:
03 02
01 00
03 02
01 00
In order to understand these tables, we’ll look at the register rax
, displayed in the first row.
rax
is a generic register that contains 8 bytes of data: from byte 0 to byte 7
as indicated by the byte numbers at the top of the table.
The register eax
gives you access to the
lower 4 bytes of rax
; reading or writing into eax
is the same as reading or writing
the bytes from 0 to 3 of rax
.
Similarly, ax
gives you access to the lower 2 bytes, and al
to the lowest byte.
Finally, some code
We are assuming that you are familiar with some programming language, it doesn’t matter which one. Assembly code syntax is similar to the programming language concepts you know: a sequence of instructions, usually one on every line, that will be executed in order.
The x86-64 assembly syntax has two different dialects: AT&T and Intel. All the code snippets in this series of articles are using the Intel syntax. The following snippet is an example of how the syntax looks like, don’t worry about what it does for now.
# this is a comment
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
add eax, 0x42
pop rbp
ret
A good way to familiarize yourself with the syntax is to look at the assembly generated from small snippets of code. The compiler explorer website is designed exactly for this use case: You can type snippets of code in any compiled language you know, and observe the generated assembly. If you hover the mouse over an assembly instruction you can even see a description of what it does.
In the next article we are going to see in details how each of the instruction in the previous example works
Further Reading
This article is still under development, and it’s improving over time.
If you reached this point, you might be interested in the next articles:
- introduction (you are here)
- moving data
- stack frames
Additional resources:
- pwn.college’s assembly module and lectures https://pwn.college/fundamentals/assembly-crash-course
- the compiler explorer website https://godbolt.org/z/c6brc1df9
- the official x86_64 reference
- unofficial x86_64 instructions reference https://www.felixcloutier.com/x86/
- The best linux syscall table reference https://syscalls.mebeim.net/?table=x86/64/x64/latest