問題描述
我想知道為什么沒有編譯器準備將相同值的連續寫入合并到單個原子變量,例如:
I'm wondering why no compilers are prepared to merge consecutive writes of the same value to a single atomic variable, e.g.:
#include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
y.store(1, order);
y.store(1, order);
}
我嘗試過的每個編譯器都會發出上述寫入的 3 次.哪個合法的、無種族的觀察者可以看到上述代碼與經過一次寫入的優化版本之間的差異(即as-if"規則不適用)?
Every compiler I've tried will issue the above write three times. What legitimate, race-free observer could see a difference between the above code and an optimized version with a single write (i.e. doesn't the 'as-if' rule apply)?
如果變量是可變的,那么顯然沒有優化是適用的.在我的情況下是什么阻止了它?
If the variable had been volatile, then obviously no optimization is applicable. What's preventing it in my case?
這是編譯器資源管理器中的代碼.
推薦答案
C++11/C++14 標準編寫確實允許將三個商店折疊/合并為一個商店的最終值.即使在這樣的情況下:
The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:
y.store(1, order);
y.store(2, order);
y.store(3, order); // inlining + constant-folding could produce this in real code
該標準不保證在 y
上旋轉的觀察者(使用原子負載或 CAS)將永遠看到 y == 2
.依賴于此的程序將具有數據競爭錯誤,但只有普通錯誤類型的競爭,而不是 C++ 未定義行為類型的數據競爭.(它只是帶有非原子變量的 UB).一個希望有時看到它的程序甚至不一定有缺陷.(見下文:進度條.)
The standard does not guarantee that an observer spinning on y
(with an atomic load or CAS) will ever see y == 2
. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)
在 C++ 抽象機器上可能的任何排序都可以(在編譯時)被選為 總是 發生的排序.這是實際中的 as-if 規則.在這種情況下,好像所有三個存儲都以全局順序背靠背發生,在 y=1
和y=3
.
Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1
and y=3
.
它不依賴于目標架構或硬件;就像編譯時重新排序一樣,即使在以強序 x86 為目標.編譯器不必保留您在考慮要編譯的硬件時可能期望的任何內容,因此您需要障礙.屏障可以編譯成零匯編指令.
It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.
這是一個實施質量問題,可能會改變在真實硬件上觀察到的性能/行為.
It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.
最明顯的問題是進度條.將存儲從循環(不包含其他原子操作)中取出并將它們全部折疊為一個將導致進度條保持在 0,然后在最后變為 100%.
The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.
沒有 C++11 std::atomic
方法可以阻止他們在你不想要的情況下這樣做,所以現在編譯器只需選擇永遠不要將多個原子操作合并為一個.(將它們全部合并為一個操作不會改變它們相對于彼此的順序.)
There's no C++11 std::atomic
way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)
編譯器編寫者已經正確地注意到,程序員期望每次源代碼執行 y.store()
時,原子存儲實際上會發生在內存中.(請參閱此問題的大多數其他答案,這些答案聲稱商店需要單獨發生,因為可能的讀者等待看到中間值.)即它違反了 最小驚喜原則.
Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store()
. (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.
但是,在某些情況下它會非常有用,例如避免在循環中使用無用的 shared_ptr
ref count inc/dec.
However, there are cases where it would be very helpful, for example avoiding useless shared_ptr
ref count inc/dec in a loop.
顯然,任何重新排序或合并都不能違反任何其他排序規則.例如,num++;num--;
仍然必須完全阻止運行時和編譯時重新排序,即使它不再觸及 num
處的內存.
Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--;
would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num
.
正在討論擴展 std::atomic
API 以讓程序員控制此類優化,此時編譯器將能夠在有用時進行優化,從而即使在并非故意低效的精心編寫的代碼中也可能發生.以下工作組討論/提案鏈接中提到了一些有用的優化案例示例:
Discussion is under way to extend the std::atomic
API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:
- http://wg21.link/n4455:N4455 沒有健全的編譯器會優化原子
- http://wg21.link/p0062:WG21/P0062R1:編譯器應該何時優化原子?莉>
- http://wg21.link/n4455: N4455 No Sane Compiler Would Optimize Atomics
- http://wg21.link/p0062: WG21/P0062R1: When should compilers optimize atomics?
另請參閱 Richard Hodges 對 int num"的 num++ 可以是原子的嗎?(見評論).另請參閱同一問題的我的回答的最后一部分,我更詳細地論證了允許這種優化.(在此簡短,因為那些 C++ 工作組鏈接已經承認當前編寫的標準確實允許這樣做,而且當前的編譯器只是沒有故意優化.)
See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)
在當前標準中,volatile atomic
將是確保不允許對其進行優化的一種方法.(正如 Herb Sutter 在 SO 答案中指出的,volatile
和 atomic
已經共享了一些需求,但它們是不同的).另請參閱 std::memory_order
與 volatile
在 cppreference 上.
Within the current standard, volatile atomic<int> y
would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile
and atomic
already share some requirements, but they are different). See also std::memory_order
's relationship with volatile
on cppreference.
對 volatile
對象的訪問不允許被優化掉(因為它們可能是內存映射的 IO 寄存器,例如).
Accesses to volatile
objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).
使用 volatile atomic
主要修復了進度條問題,但如果/當 C++ 決定使用不同的語法來控制優化以便編譯器使用不同的語法時,它有點丑陋并且可能在幾年后看起來很傻可以開始實踐了.
Using volatile atomic<T>
mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.
我認為我們可以確信編譯器不會開始進行這種優化,除非有一種方法可以控制它.希望它是某種選擇加入(如 memory_order_release_coalesce
),在編譯為 C++ 時不會改變現有代碼 C++11/14 代碼的行為.但它可能類似于 wg21/p0062 中的提議:使用 [[brittle_atomic]]
標記不優化案例.
I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce
) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]]
.
wg21/p0062 警告說,即使 volatile atomic
也不能解決所有問題,因此不鼓勵將其用于此目的.它給出了這個例子:
wg21/p0062 warns that even volatile atomic
doesn't solve everything, and discourages its use for this purpose. It gives this example:
if(x) {
foo();
y.store(0);
} else {
bar();
y.store(0); // release a lock before a long-running loop
for() {...} // loop contains no atomics or volatiles
}
// A compiler can merge the stores into a y.store(0) here.
即使使用 volatile atomic
,允許編譯器從 if/else
中提取 y.store()
并且只做一次,因為它仍然只做 1存儲相同的值.(這將在 else 分支中的長循環之后).特別是如果商店只是 relaxed
或 release
而不是 seq_cst
.
Even with volatile atomic<int> y
, a compiler is allowed to sink the y.store()
out of the if/else
and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed
or release
instead of seq_cst
.
volatile
確實停止了問題中討論的合并,但這指出 atomic<>
上的其他優化對于實際性能也可能存在問題.
volatile
does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<>
can also be problematic for real performance.
不優化的其他原因包括:沒有人編寫復雜的代碼來允許編譯器安全地進行這些優化(而不會出錯).這還不夠,因為 N4455 表示 LLVM 已經實現或可以輕松實現它提到的幾個優化.
Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.
不過,讓程序員感到困惑的原因當然是有道理的.無鎖代碼一開始就很難正確編寫.
The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.
不要隨意使用原子武器:它們并不便宜,也沒有進行太多優化(目前根本沒有).但是,使用 std::shared_ptr<T>
避免冗余原子操作并不總是那么容易,因為它沒有非原子版本(盡管 這里的一個答案給出了一個簡單的方法為 gcc 定義一個 shared_ptr_unsynchronized
).
Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>
, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T>
for gcc).
這篇關于為什么編譯器不合并冗余的 std::atomic 寫入?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!