Threads the Assembler way
For this article, I wanted to try to use threads via pthreads, but from assembler
instead of from C. Cutting to the chase, it is not only possible, but also
relatively easy, but I also made many mistakes on the way there.
Perils of Assembler programming
As you know, I've written a handful of Assembler programs for my blog at this
point. It seems to be a good point to talk briefly about the main difference or
the main challenge I met with while programming on that level.
Having written a few medium-sized C programs (including drivers running on
Windows), one might believe to know low-level programming. But writing Assembler
code clearly shows that C actually is a high-level programming language. Yes, if
you hear "high-level programming language", you might expect nice productivity
features like garbage collection or dynamic typing (or at least language support
for OOP). But if you change your point of view somewhat, you will see that C
makes a lot of stuff easy, safe and comfortable.
Yes, I just said that coding in C is safe. Again, that is relative to Assembler:
The main problem I had while writing my Assembler programs was that I wrote code
I was quite certain it would work - how could it not, as reasoning about it
seemed straight-forward? - but when running it, the output was either
"segmentation fault" or "bus error".
Segmentation faults are no stranger to any C programmer who learned from his
mistakes. They occur easily when you access a struct via a null pointer or have
a use-after-free bug in your code. Or maybe you did not respect your array
boundaries? There are many ways to provoke them.
Bus errors were new to me. According to wikipedia
they occur on unaligned memory accesses (which is only a "problem" on amd64 if
you enable checking for them, which I did not) or if you try to access memory
from an address that is wholly unrecognizable for your computer.
When you are coding Assembler programs, you get all the problems of C (well,
most of them at least. You won't have to deal with code being optimized away).
But you also get even more problems. When writing C, the compiler will yell at
you if you take an int or char** and use it as an char*. Without explicit
casting, you cannot get them wrong.
If you mistake a LEA instruction for a MOV or vice versa, you can easily make
that mistake. I make these mistakes more often than I am comfortable to admit,
and this breaks my programs and leaves me confused. Well, at least debugging
with GDB works well for Assembler programs.
That said, let's start with our program for today.
Playing with threads
What is the "hello world" of concurrency, I wondered? Ignoring that I could
start multiple threads that just print on stdout, I wanted to do something more
insightful. I settled on creating two threads that increment the same counter in
a loop. If you are familiar with the typical bugs that you can get in that
scenario, you know were I will be going with this example. If not, just follow
along.
Our standard files
I did not make a lot of changes to our standard files. But look, we are linking
to a new library in the makefile!
Makefile:
PRGNAME=asmthread OBJECTS=asmthread.o openbsd-note.o $PRGNAME: $(OBJECTS) ld --dynamic-linker=/usr/libexec/ld.so -L/usr/lib -lc -lpthread -o $(PRGNAME) \ $(OBJECTS) .s.o: as -g -o $@ $< clean: -rm *.o -rm $(PRGNAME)
openbsd-note.s:
.section ".note.openbsd.ident", "a" .p2align 2 .long 8,4,1 .ascii "OpenBSD\0" .long 0
Concurrent Assembler code, first try
.intel_syntax noprefix .text .global _start .p2align 4, 0xcc _start: mov rbp, rsp sub rsp, 16 /* enough space for two pthread variables */ lea rdi, [rsp] xor rsi, rsi lea rdx, [rip+countThread] mov rcx, 1 call pthread_create@plt lea rdi, [rsp+8] xor rsi, rsi lea rdx, [rip+countThread] mov rcx, 2 call pthread_create@plt mov rdi, [rsp] xor rsi, rsi call pthread_join@plt mov rdi, [rsp+8] xor rsi, rsi call pthread_join@plt lea rdi, [rip+.L.printCounter] mov rsi, [rip+.L.counter] call printf@plt xor rdi, rdi call exit@plt .p2align 4, 0xcc countThread: push rbp mov rbp, rsp xor rax, rax .L.loop: cmp rax, 5000000 je .L.done inc rax inc QWORD PTR [rip+.L.counter] jmp .L.loop .L.done: mov rsi, rdi /* get thread parameter */ lea rdi, [rip+.L.printThreadFinished] call printf@plt xor rax, rax mov rsp, rbp pop rbp ret .bss .L.counter: .skip 8 .section .rodata .L.printCounter: .asciz "Counter: %d\n" .L.printThreadFinished: .asciz "Thread '%d' finished\n"
We will start with anything but the actual code again. I am using the .p2align
directive here to align the _start and countThread symbols at 16-byte boundaries
and use 0xCC as a fill byte. I experimented with it while chasing some bus
errors. As far as I can tell, my program does not necessarily need this
alignment, but I kept it in as correctly aligning your functions is a good habit
to get into - for performance reasons or when coding for stricter architectures
like SPARC. By the way, I think I should start also aligning my data symbols -
there is no particular reason not to do it.
Speaking of data: This program uses a single 8-byte counter variable in the .bss
section and two read-only format strings that I will use with printf. After the
last article, these are not too exciting anymore.
Looking at the countThread routine first
countThread contains the code that will run as our thread's main function. As
such, I had to write it with C's calling convention in mind. I start by creating
a stack frame for myself, saving the base pointer (rbp) on the stack and setting
it to the current stack pointer (rsp) after that. This would allow me to store
local variables on the stack, which I don't do here. It also helps debuggers
keep up with function calls for showing backtraces.
The main part of the function is a simple loop. I use rax as my counting
register, therefore I set it to zero before using it. We then enter a loop that
will run while rax is not equal to 5.000.000.
Inside the loop itself, we both increment our counting register, but also
increment the counter variable that we declared in the .bss section. At first, I
wrote that as three separate instructions: reading the counter into a register,
incrementing that register and then writing the register back to the counter.
That works, too. But I can also use the INC instruction directly with the
counter, which is more awesome! I had to tell our GNU assembler that I want it
to thread the memory access as a pointer to 8 bytes worth of memory for that,
which I did by adding "QWORD PTR" to the memory access.
After the loop is done, the countThread function prints a "finished" message and
returns to the caller. If you now pthread, you know that each thread's main
function gets a void pointer argument passed to it. I use this as an integer
containing a thread number which I want to print out. This argument is in the
rdi register, so I have to copy it to the rsi register as the second parameter
for printf before overwriting it with the format string's address.
The program entry point, _start
After looking at countThread, the first line of _start should be easier to
understand, but I will also admit that it is quite unnecessary for this concrete
program. I initialize the rpb register to my initial stack pointer. I could use
this to easily access my program arguments. I could also use it to access the
stack variables I will use here - accessing them from rsp hinders me from
changing the stack pointer again. If I were to write the program again, I would
most likely choose an rbp-relative notation instead, so keep in mind that both
ways work but rsp-relative might not work for every program.
As I wrote before, I want to start two threads. This means that I will need two
variables of type pthread. In C, we would just declare them as such, but here,
we need to know the size of their types. I could ask a C compiler to print out
sizeof(pthread) for me, but I opted to take a look at the header. There, I found
that a pthread is a typedef for a pointer to a struct pthread_t. Pointers are 8
bytes on amd64, so we have to reserve 16 bytes for both variables.
The next block starts the first thread by calling pthread_create. It takes four
arguments, the first parameter being the address of a pthread variable (or a
pthread* in C). The second parameter can be kept as zero for my use case. The
third parameter is the function pointer to be called, or the address of
countThread in this case. Finally, the fourth parameter is the void* argument to
the thread's function I mentioned before. As I said, I use it to pass a single
integer that will be output as the thread's number. For the first thread, I
chose to use 1.
The block after that repeats the same code for the second thread. It is mostly
the same, but uses a different address on my stack for its pthread variable and
also uses 2 as the thread number parameter.
After starting the two threads, the next two blocks wait for them to complete. I
use the pthread_join function for that.
Finally, _start just has to print out the counter's final value and exit itself.
Let's run the program:
Thread '2' finished Thread '1' finished Counter: 5106983
Is that the output you expected? Executing it again might even lead to a totally
different value!
Meet my old friend, race condition
When I described the INC instruction used by the threads, I told you that I
could have written it as three separate instructions. While I did not have to
write it that way, it still works that way: It first loads the counter, then
increments it, and then writes the incremented value back to the counter.
When multiple threads are used, they can overlap in many ways. You might have
wanted them to work in a way where the first thread does these three steps and
then the second thread does the three steps ... but it does not have to work
that way and most likely will not.
It is possible for both threads to read the same counter value. They will both
increment it to the same new value and one increment will get lost. Even worse,
something like this is also possible:
counter = 0 thread 1 | thread 2 read counter = 0 | | read counter = 0 | increment to 1 | write counter = 1 | read counter = 1 | increment to 2 | write counter = 2 increment to 1 | write counter = 1|
If both threads work together like that, the counter might even go backwards at
times.
Note that if your system as multiple cores (and most systems today will have
that), both threads might even perform steps at the same time, which will only
make the problem worse.
A solution
Race conditions like that one can be solved in many ways. We could use any
locking primitive that we would also use from C, like a pthread_mutex_t. But
that would be a very blunt tool for our problem. Another solution would be to
use atomic variables, which compile to special instructions that access our
counter without creating race conditions. Let's try that!
Replace our loop with this:
.L.loop: cmp rax, 5000000 je .L.done inc rax mov rcx, 1 lock xadd [rip+.L.counter], rcx jmp .L.loop
The XADD instruction adds a register to our counter and stores the old value in
the register. You might know it as atomic_fetch_add from the C11 stdatomic.h.
But XADD alone would still have a race condition. Only with the LOCK prefix you
get the real atomic_fetch_add.
This change fixes our little bug, so that you get always the correct value when
you run the program.
Thread '2' finished Thread '1' finished Counter: 10000000
You will also notice that running this program now takes a lot more time than
the buggy version. Both threads do nearly nothing but contend for the lock. A
simple fix for that would reduce lock contention as far as possible: Both
threads could increment their own counter and only after the loop add that to
the shared counter. That would be much faster. But then again, you could also
remove the loop and just add the final sum. Or just replace the whole program
with a single puts containing the pre-calculated sum.
I wrote this code to learn about atomic variables in Assembler programs. That is
what I did. In real programs, you will have to decide how much optimization is
called for and which optimizations to use.