class: center, middle # Why Languages Should Preserve Load-Store Order Stephen Dolan 15th January, 2024 --- # Load-buffering This talk (like many) is about variants of this basic program: .col-left[ ``` x := A ... B := x ``` ] .col-right[ ``` y := B ... A := y ``` ] --- # A simple solution: ban it .big[.center[`acyclic(po | rf)`]] ??? Boehm & Dempsky, Outlawing Ghosts, 2014 Vafeiadis & Narayan, Relaxed Separation Logic, 2013 Dang, Jourdan, Kaiser, Deyer, RustBelt Relaxed, 2019 --- # Allowing load-buffering - **Compilers and hardware don't actually make up values** This is a specification difficulty rather than a practical concern. -- - **Only _independent_ load/store pairs are reordered** These are unproblematic, although defining independence in a way that survives a compiler is tricky. -- - **Allowing these reorderings improves performance** Particularly, some ARM hardware does this, for independent operations. --- # Compilers make stuff up sometimes Given a nonatomic global `int A = 0;` .col-left[ ``` A --; ``` ] .col-right[ ``` A |= 1; ``` ] -- - atomic RMWs: `-1`, `0` -- - atomic load/store: `-1`, `0`, `+1` -- - GCC on x86: `-1`, `0`, `+1`, `-255` -- **Compilers evolve to become as weird as the specification permits.** --- # Load-buffering happens with "dependent" operations Given `int* _Atomic A; int* _Atomic B`, both initially non-`NULL`: .col-left[ ``` int* p = load_rlx(&A); if (p != NULL) store_rlx(&B, NULL); ``` ] .col-right[ ``` int* r = load_rlx(&B); store_rlx(&A, r); ``` ] Can `p` be `NULL`? -- - **No**: the only `NULL` arises from the store to `B`, but this does not occur if `p` is `NULL` --- count:false # Load-buffering happens with "dependent" operations Given `int* _Atomic A; int* _Atomic B`, both initially non-`NULL`: .col-left[ ``` int* p = load_rlx(&A); if (p != NULL) store_rlx(&B, NULL); return *p; ``` ] .col-right[ ``` int* r = load_rlx(&B); store_rlx(&A, r); ``` ] Can `p` be `NULL`? - **No**: the only `NULL` arises from the store to `B`, but this does not occur if `p` is `NULL` -- - **Yes**: the `p != NULL` check is redundant and can be removed, then the store reordered. ??? Lochbihler, Making the Java Memory Model Safe, 2014 Boehm, Out-of-thin-air, revisited, again, 2019 --- # Allowing load-buffering makes performance *worse* .col-left[ ``` *q = r; return *q - r; ``` ] --- count:false # Allowing load-buffering makes performance *worse* .col-left[ ``` int go(int* p) { int r; int* q = foo(&r); r = 2; *p = 1; *q = r; return *q - r; } ``` ] -- .col-right[ ``` mov w8, #2 mov w9, #1 str w8, [x29, #28] str w9, [x19] str w8, [x0] ldr w9, [x29, #28] sub w0, w8, w9 ``` ] -- Clang assumes that `*p` and `r` do not alias. --- count:false # Allowing load-buffering makes performance *worse* .col-left[ ``` int go(int* p) { int r; int* q = foo(&r); r = 2; *p = 1; *q = r; return *q - r; } ``` ] .col-right[ ``` void other_thread() { int* x = load_rlx(&B); store_rlx(&A, x); } int* foo(int* r) { store_rlx(&B, r); return &global; } go(load_rlx(&A)); ``` ] Clang assumes that `*p` and `r` do not alias. **But load-buffering means they might!** --- count:false # Allowing load-buffering makes performance *worse* - Load-buffering means that fresh locations may not be fresh - Standard compiler optimisations are *unsound* in the presence of load-buffering - Soundly allowing load buffering would inhibit optimisations, even of sequential code --- # The cost of preventing load-buffering Between each relaxed atomic load and subsequent store, we need a fake dependency. -- Overhead for concurrent data structure benchmarks [Ou & Dempsky 2018]: .big[.center[ average: -0.3% worst: 6.3% ]] ??? Ou & Dempsky, Towards understanding the costs of avoiding out-of-thin-air results, 2018 --- # The benefits of preventing load-buffering - Sound sequential optimisations - Inductive reasoning on values - Existing program logics --- # Questions? .big[.center[`acyclic(po | rf)`]] .center[is a great axiom, and deserves a place in your next memory model.]