Threads the Assembler way

For this article, I wanted to try to use threads via pthreads, but from assembler instead of from C. Cutting to the chase, it is not only possible, but also relatively easy, but I also made many mistakes on the way there.

Perils of Assembler programming

As you know, I've written a handful of Assembler programs for my blog at this point. It seems to be a good point to talk briefly about the main difference or the main challenge I met with while programming on that level.

Having written a few medium-sized C programs (including drivers running on Windows), one might believe to know low-level programming. But writing Assembler code clearly shows that C actually is a high-level programming language. Yes, if you hear "high-level programming language", you might expect nice productivity features like garbage collection or dynamic typing (or at least language support for OOP). But if you change your point of view somewhat, you will see that C makes a lot of stuff easy, safe and comfortable.

Yes, I just said that coding in C is safe. Again, that is relative to Assembler: The main problem I had while writing my Assembler programs was that I wrote code I was quite certain it would work - how could it not, as reasoning about it seemed straight-forward? - but when running it, the output was either "segmentation fault" or "bus error".

Segmentation faults are no stranger to any C programmer who learned from his mistakes. They occur easily when you access a struct via a null pointer or have a use-after-free bug in your code. Or maybe you did not respect your array boundaries? There are many ways to provoke them.

Bus errors were new to me. According to wikipedia they occur on unaligned memory accesses (which is only a "problem" on amd64 if you enable checking for them, which I did not) or if you try to access memory from an address that is wholly unrecognizable for your computer.

Bus error on wikipedia

When you are coding Assembler programs, you get all the problems of C (well, most of them at least. You won't have to deal with code being optimized away). But you also get even more problems. When writing C, the compiler will yell at you if you take an int or char** and use it as an char*. Without explicit casting, you cannot get them wrong.

If you mistake a LEA instruction for a MOV or vice versa, you can easily make that mistake. I make these mistakes more often than I am comfortable to admit, and this breaks my programs and leaves me confused. Well, at least debugging with GDB works well for Assembler programs.

That said, let's start with our program for today.

Playing with threads

What is the "hello world" of concurrency, I wondered? Ignoring that I could start multiple threads that just print on stdout, I wanted to do something more insightful. I settled on creating two threads that increment the same counter in a loop. If you are familiar with the typical bugs that you can get in that scenario, you know were I will be going with this example. If not, just follow along.

Our standard files

I did not make a lot of changes to our standard files. But look, we are linking to a new library in the makefile!

Makefile:

PRGNAME=asmthread
OBJECTS=asmthread.o openbsd-note.o

$PRGNAME: $(OBJECTS)
	ld --dynamic-linker=/usr/libexec/ld.so -L/usr/lib -lc -lpthread -o $(PRGNAME) \
		$(OBJECTS) 

.s.o:
	as -g -o $@ $<

clean:
	-rm *.o
	-rm $(PRGNAME)

openbsd-note.s:

.section ".note.openbsd.ident", "a"
        .p2align   2
        .long      8,4,1
        .ascii      "OpenBSD\0"
        .long      0

Concurrent Assembler code, first try

.intel_syntax noprefix

.text

.global _start
.p2align	4, 0xcc
_start:
    mov rbp, rsp

    sub rsp, 16 /* enough space for two pthread variables */

    lea rdi, [rsp]
    xor rsi, rsi
    lea rdx, [rip+countThread]
    mov rcx, 1
    call pthread_create@plt

    lea rdi, [rsp+8]
    xor rsi, rsi
    lea rdx, [rip+countThread]
    mov rcx, 2
    call pthread_create@plt

    mov rdi, [rsp]
    xor rsi, rsi
    call pthread_join@plt

    mov rdi, [rsp+8]
    xor rsi, rsi
    call pthread_join@plt

    lea rdi, [rip+.L.printCounter]
    mov rsi, [rip+.L.counter]
    call printf@plt

    xor rdi, rdi
    call exit@plt

.p2align	4, 0xcc
countThread:
    push rbp
    mov rbp, rsp

    xor rax, rax

.L.loop:
    cmp rax, 5000000
    je .L.done

    inc rax
    inc QWORD PTR [rip+.L.counter]

    jmp .L.loop

.L.done:
    mov rsi, rdi /* get thread parameter */
    lea rdi, [rip+.L.printThreadFinished]
    call printf@plt

    xor rax, rax

    mov rsp, rbp
    pop rbp
    ret

.bss
.L.counter:
    .skip 8

.section .rodata
.L.printCounter:
    .asciz "Counter: %d\n"
.L.printThreadFinished:
    .asciz "Thread '%d' finished\n"

We will start with anything but the actual code again. I am using the .p2align directive here to align the _start and countThread symbols at 16-byte boundaries and use 0xCC as a fill byte. I experimented with it while chasing some bus errors. As far as I can tell, my program does not necessarily need this alignment, but I kept it in as correctly aligning your functions is a good habit to get into - for performance reasons or when coding for stricter architectures like SPARC. By the way, I think I should start also aligning my data symbols - there is no particular reason not to do it.

Speaking of data: This program uses a single 8-byte counter variable in the .bss section and two read-only format strings that I will use with printf. After the last article, these are not too exciting anymore.

Looking at the countThread routine first

countThread contains the code that will run as our thread's main function. As such, I had to write it with C's calling convention in mind. I start by creating a stack frame for myself, saving the base pointer (rbp) on the stack and setting it to the current stack pointer (rsp) after that. This would allow me to store local variables on the stack, which I don't do here. It also helps debuggers keep up with function calls for showing backtraces.

The main part of the function is a simple loop. I use rax as my counting register, therefore I set it to zero before using it. We then enter a loop that will run while rax is not equal to 5.000.000.

Inside the loop itself, we both increment our counting register, but also increment the counter variable that we declared in the .bss section. At first, I wrote that as three separate instructions: reading the counter into a register, incrementing that register and then writing the register back to the counter. That works, too. But I can also use the INC instruction directly with the counter, which is more awesome! I had to tell our GNU assembler that I want it to thread the memory access as a pointer to 8 bytes worth of memory for that, which I did by adding "QWORD PTR" to the memory access.

After the loop is done, the countThread function prints a "finished" message and returns to the caller. If you now pthread, you know that each thread's main function gets a void pointer argument passed to it. I use this as an integer containing a thread number which I want to print out. This argument is in the rdi register, so I have to copy it to the rsi register as the second parameter for printf before overwriting it with the format string's address.

The program entry point, _start

After looking at countThread, the first line of _start should be easier to understand, but I will also admit that it is quite unnecessary for this concrete program. I initialize the rpb register to my initial stack pointer. I could use this to easily access my program arguments. I could also use it to access the stack variables I will use here - accessing them from rsp hinders me from changing the stack pointer again. If I were to write the program again, I would most likely choose an rbp-relative notation instead, so keep in mind that both ways work but rsp-relative might not work for every program.

As I wrote before, I want to start two threads. This means that I will need two variables of type pthread. In C, we would just declare them as such, but here, we need to know the size of their types. I could ask a C compiler to print out sizeof(pthread) for me, but I opted to take a look at the header. There, I found that a pthread is a typedef for a pointer to a struct pthread_t. Pointers are 8 bytes on amd64, so we have to reserve 16 bytes for both variables.

The next block starts the first thread by calling pthread_create. It takes four arguments, the first parameter being the address of a pthread variable (or a pthread* in C). The second parameter can be kept as zero for my use case. The third parameter is the function pointer to be called, or the address of countThread in this case. Finally, the fourth parameter is the void* argument to the thread's function I mentioned before. As I said, I use it to pass a single integer that will be output as the thread's number. For the first thread, I chose to use 1.

The block after that repeats the same code for the second thread. It is mostly the same, but uses a different address on my stack for its pthread variable and also uses 2 as the thread number parameter.

After starting the two threads, the next two blocks wait for them to complete. I use the pthread_join function for that.

Finally, _start just has to print out the counter's final value and exit itself.

Let's run the program:

Thread '2' finished
Thread '1' finished
Counter: 5106983

Is that the output you expected? Executing it again might even lead to a totally different value!

Meet my old friend, race condition

When I described the INC instruction used by the threads, I told you that I could have written it as three separate instructions. While I did not have to write it that way, it still works that way: It first loads the counter, then increments it, and then writes the incremented value back to the counter.

When multiple threads are used, they can overlap in many ways. You might have wanted them to work in a way where the first thread does these three steps and then the second thread does the three steps ... but it does not have to work that way and most likely will not.

It is possible for both threads to read the same counter value. They will both increment it to the same new value and one increment will get lost. Even worse, something like this is also possible:

            counter = 0
thread 1         |          thread 2
read counter = 0 |
                 | read counter = 0
                 | increment to 1
                 | write counter = 1
                 | read counter = 1
                 | increment to 2
                 | write counter = 2
increment to 1   |
write counter = 1|

If both threads work together like that, the counter might even go backwards at times.

Note that if your system as multiple cores (and most systems today will have that), both threads might even perform steps at the same time, which will only make the problem worse.

A solution

Race conditions like that one can be solved in many ways. We could use any locking primitive that we would also use from C, like a pthread_mutex_t. But that would be a very blunt tool for our problem. Another solution would be to use atomic variables, which compile to special instructions that access our counter without creating race conditions. Let's try that!

Replace our loop with this:

.L.loop:
    cmp rax, 5000000
    je .L.done

    inc rax
    mov rcx, 1
    lock xadd [rip+.L.counter], rcx

    jmp .L.loop

The XADD instruction adds a register to our counter and stores the old value in the register. You might know it as atomic_fetch_add from the C11 stdatomic.h. But XADD alone would still have a race condition. Only with the LOCK prefix you get the real atomic_fetch_add.

This change fixes our little bug, so that you get always the correct value when you run the program.

Thread '2' finished
Thread '1' finished
Counter: 10000000

You will also notice that running this program now takes a lot more time than the buggy version. Both threads do nearly nothing but contend for the lock. A simple fix for that would reduce lock contention as far as possible: Both threads could increment their own counter and only after the loop add that to the shared counter. That would be much faster. But then again, you could also remove the loop and just add the final sum. Or just replace the whole program with a single puts containing the pre-calculated sum.

I wrote this code to learn about atomic variables in Assembler programs. That is what I did. In real programs, you will have to decide how much optimization is called for and which optimizations to use.