Published on

Multithreading in C++: Memory Ordering

Authors
  • avatar
    Name
    Ramtin Tajbakhsh
    Twitter

Memory Model and Enforced Ordering

Both the Compiler and the CPU can reorder memory accesses, meaning that they can happen in a different order from what's specified in the code. These reordering happen for optimization purposes and we don't really care about them in single threaded applications However, this can become problematic in multithreaded applications where multiple threads may need to read from and write to shared memory simultaneously.

There are three different memory ordering models each offering different levels of restrictions and rules. They determine the degree of freedom that the CPU and the compiler have to reorder the operations in one thread, and to propagate changes between multiple threads. Here is a brief description of each model:

  • Sequentially Consistent: This is the most strict and straightforward ordering model. It guarantees that all the threads will agree on the same order of events, meaning that they will establish a single total modification order of all atomic operations that are tagged with memory_order_seq_cst.
  • Relaxed: This ordering poses no constraints on the CPU and the compiler. The only guarantee with this ordering is that the operations will be carried out atomically. This ordering applies to operations tagged with memory_order_relaxed.
  • Acquire/Release: If the result of a release atomic store (atomicVar.store(value, std::memory_order_release)) is read by an acquire atomic load (atomicVar.load(std::memory_order_acquire)) in another thread, the second thread is guaranteed to see everything that happened before the atomic store. This ordering model is a middle ground between the other two in terms of constraints and it has better performance than Sequentially Consistent ordering.

Before diving deep into each of these models, let's go over some formal definitions and examples.

Sequenced-before

When evaluation A happens before evaluation B in the same thread, A is sequenced-before B.

// A is sequenced-before B
int num = 10; // A
num = 30; // B

Synchronizes-with

// Thread 1
atomicFlag.store(true, std::memory_order_release);

// Thread 2
while (!atomicFlag.load(std::memory_order_acquire));

In this example, the atomic store in Thread 1 is synchronized with the atomic load in Thread 2. The use case of this will become clear when we go over the Acquire/Release ordering in more detail.

Inter-thread happens-before

Evaluation A inter-thread happens-before B in the following situations:

  • A synchronizes-with B
// Thread 1
atomicNum.store(20, std::memory_order_release); // A

// Thread 2
while (atomicNum.load(std::memory_order_acquire) != 20); // B
  • A synchronizes-with X and X is sequenced-before B
// Thread 1
atomicFlag.store(true, std::memory_order_release); // A

// Thread 2
while (!atomicFlag.load(std::memory_order_acquire)); // X
std::cout << "Hello"; // B
  • A is sequenced-before X, and X inter-thread happens-before B
// Thread 1
atomicNum.store(20, std::memory_order_relaxed); // A
atomicFlag.store(true, std::memory_order_release); // X

// Thread 2
while (!atomicFlag.load(std::memory_order_acquire));
int value = atomicNum.load(std::memory_order_relaxed); // B
  • A inter-thread happens-before X, and X inter-thread happens-before B (transitive property).
// Thread 1
atomicNum.store(20, std::memory_order_relaxed); // A
flag1.store(true, std::memory_order_release);

// Thread 2
while (!flag1.load(std::memory_order_acquire)); // X
flag2.store(true, std::memory_order_release);

// Thread 3
while (!flag2.load(std::memory_order_acquire));
int value = atomicNum.load(std::memory_order_relaxed); // B

Happens-before

We say A happens-before B if either A is sequenced-before B or A inter-thread happens-before B.

Sequencing Models

Sequentially Consistent Ordering

In the absence of any explicit ordering, memory_order_seq_cst is the default. The sequentially consistent ordering is the most restrictive ordering out of all the options, and it's the most intuitive one. It guarantees that all the threads will agree on the same sequence of events between the atomic operations, and the atomic stores are synchronized with subsequent atomic loads.

There is only one downside with this approach. This restriction and the guarantee of ordering comes at the cost of performance. In multicore systems, different cores may need to perform additional expensive synchronization instructions in order to guarantee sequential consistency between each other. One such instruction is XCHG which is discussed later on in this article.

Let's go over an example of what sequentially consistent ordering means:

// flag initially holds false

// Thread 1
atomicNumber.store(20, std::memory_order_seq_cst); // 1
flag.store(true, std::memory_order_seq_cst); // 2

// Thread 2
while (!flag.load(std::memory_order_seq_cst)); // 3
assert(atomicNumber.load() == 20); // 4

In this example, the assertion on line 4 will never fail. This is because the atomic store on line 2 in Thread 1 is synchronized with the atomic load on line 3 in Thread 2 and in Thread 1, the atomic store in the atomicNumber variable, happens before the atomic store on the flag. Due to the transitive property of the happens before relationship, line 1 happens before line 4 and therefore, the assertion will always succeed.

This example is very intuitive and is what we would expect to happen. This is exactly why memory_order_seq_cst is the default mode for all the atomic operations. Unless we have a good reason not to, we can usually stick with this memory ordering model. As we will see in a bit, the assertion above may fail if we used memory_order_relaxed instead.

Relaxed Ordering

memory_order_relaxed is the exact opposite side of the coin. Minimal to no guarantee is made on the ordering of the atomic operations used with this model. Another interesting thing with relaxed ordering is that different threads may not agree on the same order of events. Meaning that 2 threads may see 2 different values for the same atomic variable at the exact same time.

Let's see an example of what's totally valid but unexpected when memory_order_relaxed is used:

// Shared
std::atomic<int> num(0);
std::atomic<bool> flag(false);

// Thread 1
num.store(10, std::memory_order_relaxed);
num.store(20, std::memory_order_relaxed);
num.store(30, std::memory_order_relaxed);
num.store(40, std::memory_order_relaxed);
num.store(50, std::memory_order_relaxed);
flag.store(true, std::memory_order_relaxed);

// Thread 2
while (!flag.load(std::memory_order_relaxed));
int value = num.load(std::memory_order_relaxed);

In Thread 2, value can be 10, 20, 30, 40, or 50. All of these are valid outcomes because we used memory_order_relaxed and there is no guarantee that both threads will see the same value for the atomic variable num. Some rules still apply, even with relaxed ordering. For instance, let's assume that thread 2 sees the value 30 the first time it calls num.load(std::memory_order_relaxed). In this situation, the subsequent calls to num.load(std::memory_order_relaxed) in thread 2 cannot see any of the previous atomic stores on num, so they will either still see 30, or they will see the subsequent atomic stores in thread 1 which are 40 and 50.

The benefit of using a relaxed ordering is that it has the best performance but it's also the trickiest to get right. Due to the looseness of the rules enforced by memory_order_relaxed, it shouldn't be used often unless performance if of utmost importance. A common application of memory_order_relaxed is incrementing a counter because we don't really care about the order in which the counter gets incremented, as long as it does eventually and atomically.

Acquire/Release Ordering

The Acquire/Release ordering model is a middle ground between the sequentially consistent and relaxed models. It has better performance than sequentially consistent, and it provides useful synchronization guarantees in multithreaded programs.

If a load atomic operation tagged with memory_order_acquire in Thread B, reads a value stored in the same atomic variable with an atomic store tagged with memory_order_release in Thread A, then Thread A synchronizes with Thread B, and they will participate in a release sequence. This means that all atomic and non-atomic memory accesses before the release store are visible after the acquire load. In terms of the definitions we had earlier, every memory access that appears before the store tagged with memory_order_release, happens before the instructions that appear after the load tagged with memory_order_acquire.

This synchronization is only established between the threads releasing and acquiring the same atomic variable. Memory accesses may still propagate in a different order to other threads that don't perform an acquire operation on the same atomic variable.

// Thread 1
nonAtomicNum = 20
atomicNum.store(40, std::memory_order_relaxed);
flag.store(true, std::memory_order_release); // A

// Thread 2
while (!flag.load(std::memory_order_acquire)); // B
assert(nonAtomicNum == 20); // Never fires
assert(atomicNum.load(std::memory_order_relaxed) == 40); // Never fires

In the example above, Thread 1 synchronizes with Thread 2 on flag, which is an atomic boolean. Therefore, all the writes before the store tagged with memory_order_release are going to be visible in Thread 2 after the synchronization. As a result, both the asserts are always guaranteed to succeed.

Behind the Scenes

x86 vs ARM

In strongly ordered architectures like x86, release-acquire is the default and it is the minimum amount of constraint imposed by the CPU on the ordering of the instructions. No additional instruction is generated for memory_order_release and memory_order_acquire as opposed to memory_order_relaxed in these architectures.

atomicNum.store(10, std::memory_order_release); // A
atomicNum.store(10, std::memory_order_relaxed); // B

Lines A and B in the example above produce the exact same assembly code. The mov instruction guarantees release-acquire ordering in the x86 architecture (compiled with x86 GCC 13.2):

mov DWORD PTR atomicNum[rip], 10

in weakly ordered systems like ARM however, the CPU may reorder the operations that are tagged with memory_order_release. In these systems, additional instructions are required to enforce release-acquire ordering. Line A in the example above generates the following arm assembly (compiled with ARM GCC 13.2.0):

; counter.store(10, std::memory_order_relaxed)
movs r0, #10

whereas Line B generates the following:

; counter.store(10, std::memory_order_release)
dmb ish
movs r2, #10 

dmb stands for Data Memory Barrier which according to the arm documentation, "ensures that all explicit memory accesses that appear in program order before the DMB instruction are observed before any explicit memory accesses that appear in program after the DMB instruction". This is exactly the behavior we would expect from memory_order_release. For operations that are tagged with memory_order_acquire, the data memory barrier (dmb) will be placed after the main instruction. and with memory_order_seq_cst, the main instruction will be wrapped with two data memory barriers:

; counter.store(10, std::memory_order_seq_cst);
dbm ish
movs r2, #10
dmb ish

To ensure sequential consistency in x86 architectures, the compiler either uses mov +mfence or mov + xchg. The mov by itself is not enough since it only has release semantics. xchg has an implicit lock prefix which makes it a full memory barrier. I got the following assembly when compiling with x86 GCC 13.2:

; counter.store(10, std::memory_order_seq_cst)
mov eax, 10
xchg eax, DWORD PTR counter[rip]

These additional instructions explain the performance difference between different memory models and why they work the way they do.

Further Readings