I’m nowhere near as deep in this rabbit hole. But I recall running/seeing some b...

gpderetta · on June 15, 2023

At leas on x86, acq/rel load/stores vs relaxed is basically free. Seq/cst loads are also free. Seq/cst are relatively fast, but at around 20-30 clock cycles still measurably slower thant everything else.

The catch is that x86 only has seq/cst atomic RMW so even if you ask for, say, a relaxed CAS or XADD, you will still get an expensive one.

So the c++11 memory model allows you to more easily maintain correctness (and your sanity), but for performance you still have to know how to map to the underlying microarchitecture.

ComputerGuru · on June 16, 2023

> At leas on x86, acq/rel load/stores vs relaxed is basically free.

The corollary to this is that using them doesn't buy you any performance (under x86).

(Also it makes it difficult to test that your code itself synchronizes correctly if you're developing on x86, since bugs may only show up under ARM or RISC or whatever, and even there, only some of the time... and that's why project loom, tsan, and miri exist.)

klabb3 · on June 15, 2023

> Seq/cst are relatively fast, but at around 20-30 clock cycles

Did you mean to say seq/cst store?

Also, what operation is “set the value to X unconditionally and return me the previous value”? Is that possible with a store or something different? (Golang calls this op atomic swap)

In either case, sounds like the room for optimizing for performance with granular memory models on x86 is even narrower than I thought.

gpderetta · on June 15, 2023

> Did you mean to say seq/cst store?

Indeed!

> set the value to X unconditionally and return me the previous value

That would be atomic::exchange that maps to XCHG on x86, which, as all atomic RMW is sequentially consistent.

Incidentally seq-cst stores are also typically lowered to XCHG on x86 as opposed to the more obvious MFENCE+MOV.

There is still room for optimization, as if you can implement your algos with just load/stores and as few strategically placed RMW as you can, it can be a win.

Of course if there is any contention, cache coherence traffic is going to dominate over any atomic cost.

klabb3 · on June 15, 2023

Thanks a ton!

This matches my own micro-benchmarks in golang ish. I see basically either ~4ns for any write op, including store, swap, add, etc. And ~1ns for loads. I assume it’s all seq-cst.

jcranmer · on June 15, 2023

As noted elsewhere, x86 itself really doesn't have much difference. But it matters a lot more on other architectures--I caused like a 20-30% regression in JS performance on ARM by changing one atomic variable in the engine to sequentially-consistent instead of release-acquire.

My recommendations boil down to the following:

* If you're ever truly unsure, just stick with sequentially-consistent unless performance is so critical you need to get off of it. Correctness is more important that speed!

* You can use acquire/release if you've got something that smells sufficiently like a lock (there's a clear scope with beginning and end, and most of the memory accesses outside the acquire/release themselves are regular, unsynchronized accesses). Most of this code should probably be hidden in libraries anyways, but this probably should be your basic default if you're working with atomics if there's only one atomic variable in play.

* The other memory orderings I wouldn't recommend at all. Release/consume, even were it implemented by compilers as intended, requires a particular (though common) set of circumstances to work correctly. Relaxed affords no synchronization opportunities, and the one use case I can think of for it involves atomic read-modify-write operations, which I think all hardware makes as strong as an release+acquire anyways.

In short, worrying about sequential consistency versus release/acquire can be helpful, and I think there are simple enough rules-of-thumb to make it worthwhile to summarize it. The other memory orderings, not so much.

klabb3 · on June 16, 2023

> I caused like a 20-30% regression in JS performance on ARM by changing one atomic variable in the engine to sequentially-consistent instead of release-acquire.

Very interesting. Was this overall performance? What type of workload was involved?

Yeah it seems like acq-rel is the only other one worth keeping an eye out for. When using atomics you have different logical ops with a certain happens-before relation between them anyway. Figuring out whether these ops map to acq-rel seems like a reasonable task to take on, given the total effort. The main argument against it is lack of testing infrastructure (since indeed correctness is more important). With something like Loom (the Rust project) it’s significantly easier to prevent subtle bugs. I wish it was more widely available.