日本免费www,伊人精品,一区二区三区国产亚洲网站

本文介紹了理解 std::hardware_corruption_interference_size 和 std::hardware_constructive_interference_size的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

C++17 添加了 std::hardware_corruption_interference_size 和 <代碼>std::hardware_constructive_interference_size.首先，我認為這只是一種獲取 L1 緩存行大小的可移植方式，但這過于簡單化了.

問題:

這些常量與 L1 緩存行大小有什么關系?
是否有一個很好的例子來展示他們的用例?
兩者都定義了static constexpr.如果您構建一個二進制文件并在具有不同緩存行大小的其他機器上執行它，這不是問題嗎?當您不確定代碼將在哪臺機器上運行時，它如何防止這種情況下的錯誤共享?

解決方案

這些常量的目的確實是為了獲得緩存行的大小.閱讀它們的基本原理的最佳位置是提案本身:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html

為了便于閱讀，我將在此處引用部分理由:

<塊引用>

[...] 不干擾(一階)的內存粒度[是] 通常稱為緩存行大小.

緩存行大小的使用分為兩大類:

避免具有來自不同線程的時間上不相交的運行時訪問模式的對象之間的破壞性干擾(錯誤共享).
促進具有臨時本地運行時訪問模式的對象之間的建設性干擾(真正共享).

這個有用的實現數量最重要的問題是當前實踐中用于確定其價值的方法的可移植性有問題，盡管它們作為一個群體普遍存在和流行.[...]

我們的目標是為此貢獻一個適度的發明，這個數量的抽象可以通過實現為給定目的保守地定義:

破壞性干擾大小:適合作為兩個對象之間的偏移量的數字，以避免由于來自不同線程的不同運行時訪問模式而導致錯誤共享.
建設性干擾大小:一個適合作為限制兩個對象的組合內存占用大小和基址對齊的數字，以促進它們之間的真正共享.

在這兩種情況下，這些值都是在實現質量的基礎上提供的，純粹是作為可能提高性能的提示.這些是與 alignas() 關鍵字一起使用的理想可移植值，目前幾乎沒有標準支持的可移植用途.

<小時>

這些常量與 L1 緩存行大小有什么關系?"

理論上，很直接.

假設編譯器確切地知道您將在什么架構上運行 - 那么這些幾乎肯定會準確地為您提供 L1 緩存行大小.(正如后面提到的，這是一個很大的假設.)

就其價值而言，我幾乎總是希望這些值相同.我相信單獨聲明它們的唯一原因是為了完整性.(也就是說，也許編譯器想要估計 L2 緩存行大小而不是 L1 緩存行大小以進行建設性干擾；不過，我不知道這是否真的有用.)

<小時>

有沒有一個很好的例子來展示他們的用例?"

在這個答案的底部，我附上了一個很長的基準程序，它演示了錯誤共享和真實共享.

它通過分配一個 int 包裝器數組來演示錯誤共享:在一種情況下，多個元素適合 L1 緩存行，而在另一種情況下，單個元素占用 L1 緩存行.在緊密循環中，從數組中選擇一個固定的元素并重復更新.

它通過在包裝器中分配一對整數來展示真正的共享:在一種情況下，這對整數中的兩個整數不適合一起在 L1 緩存行大小中，而在另一種情況下.在一個緊密的循環中，對中的每個元素都會重復更新.

注意訪問被測對象的代碼不會改變；唯一的區別是對象本身的布局和對齊方式.

我沒有 C++17 編譯器(假設大多數人目前也沒有)，所以我用我自己的常量替換了有問題的常量.您需要更新這些值以使其在您的機器上準確無誤.也就是說，64 字節可能是典型現代桌面硬件的正確值(在撰寫本文時).

警告:測試將使用您機器上的所有內核，并分配約 256MB 的內存.不要忘記進行優化編譯！

在我的機器上，輸出是:

<前>硬件并發:16大小(naive_int):4alignof(naive_int): 4大小(cache_int):64alignof(cache_int): 64大小(壞對):72alignof(bad_pair): 4大小(good_pair):8alignof(good_pair): 4運行 naive_int 測試.平均時間:0.0873625 秒，無用結果:3291773運行 cache_int 測試.平均時間:0.024724 秒，無用結果:3286020運行 bad_pair 測試.平均時間:0.308667 秒，無用結果:6396272運行 good_pair 測試.平均時間:0.174936 秒，無用結果:6668457

通過避免錯誤共享，我獲得了大約 3.5 倍的加速，通過確保真實共享獲得了大約 1.7 倍的加速.

<小時>

兩者都定義為靜態 constexpr.如果您構建一個二進制文件并在具有不同緩存行大小的其他機器上執行它，這不是問題嗎?當您不確定時，它如何防止這種情況下的錯誤共享您的代碼將在哪臺機器上運行?"

這確實會有問題.這些常量不能保證映射到特定目標機器上的任何緩存行大小，但旨在成為編譯器可以收集的最佳近似值.

提案中對此進行了說明，并且在附錄中他們給出了一些庫如何在編譯時根據各種環境提示和宏嘗試檢測緩存行大小的示例.你保證這個值至少是alignof(max_align_t)，這是一個明顯的下限.

換句話說，這個值應該用作你的后備案例；如果您知道，您可以自由定義一個精確的值，例如:

constexpr std::size_t cache_line_size() {#ifdef KNOWN_L1_CACHE_LINE_SIZE返回 KNOWN_L1_CACHE_LINE_SIZE;#別的返回 std::hardware_corruption_interference_size;#萬一}

在編譯期間，如果您想假設緩存行大小，只需定義 KNOWN_L1_CACHE_LINE_SIZE.

希望這有幫助！

基準計劃:

#include #include <條件變量>#include #include <功能>#include <未來>#include #include <隨機>#include <線程>#include <向量>//！?。∧惚仨毟逻@個才能準確?。?！constexpr std::size_t 硬件破壞性干擾大小 = 64；//?。?！你必須更新這個才能準確！?。onstexpr std::size_t hardware_constructive_interference_size = 64;constexpr 無符號 kTimingTrialsToComputeAverage = 100;constexpr 無符號 kInnerLoopTrials = 1000000;typedef 無符號 useless_result_t;typedef double elapsed_secs_t;////////要采樣的代碼://包裝一個 int，默認對齊方式允許錯誤共享結構天真_int {整數值；};靜態斷言(alignof(naive_int)<硬件破壞性干擾大小，")；//包裝一個 int，緩存對齊防止錯誤共享結構緩存_int {alignas(hardware_corruption_interference_size) int 值；};靜態斷言(alignof(cache_int)==硬件破壞性干擾大小，")；//包裝一對 int，故意將它們分開太遠以供真正共享結構壞對{整數優先；字符填充[hardware_constructive_interference_size];整數秒；};static_assert(sizeof(bad_pair) > hardware_constructive_interference_size, "");//包裝一對 int，確保它們很好地配合在一起以實現真正的共享結構good_pair {整數優先；整數秒；};static_assert(sizeof(good_pair) <= hardware_constructive_interference_size, "");//多次訪問特定的數組元素模板 useless_result_t sample_array_threadfunc(鎖存器閂鎖，未簽名的線程索引，夯;vec) {//準備計算std::random_device rd;std::mt19937 mt{ rd() };std::uniform_int_distribution距離{ 0, 4096 };自動&element = vec[vec.size()/2 + thread_index];閂鎖.count_down_and_wait();//計算for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {element.value = dist(mt);}return static_cast(element.value);}//多次訪問一對元素模板 useless_result_t sample_pair_threadfunc(鎖存器閂鎖，未簽名的線程索引，夯;一對) {//準備計算std::random_device rd;std::mt19937 mt{ rd() };std::uniform_int_distribution距離{ 0, 4096 };閂鎖.count_down_and_wait();//計算for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {pair.first = dist(mt);pair.second = dist(mt);}return static_cast(pair.first) +static_cast(pair.second);}////////實用程序://實用程序:允許線程等待所有人都準備好類線程鎖{民眾:顯式線程鎖(const std::size_t count):計數_{計數}{}void count_down_and_wait() {std::unique_lock鎖{互斥體_}；如果(--count_ == 0){cv_.notify_all();}別的 {cv_.wait(lock, [&] { return count_ == 0; });}}私人的:std::mutex mutex_;std::condition_variable cv_；std::size_t count_；};//實用程序:在 N 個線程中運行給定的函數std::tuple運行線程(const std::function<useless_result_t(threadlatch&, unsigned)>&功能，const unsigned num_threads) {線程閂鎖{ num_threads + 1 };std::vector<std::future<useless_result_t>>期貨;std::vector線程；for (unsigned thread_index = 0; thread_index != num_threads; ++thread_index) {std::packaged_task任務{std::bind(func, std::ref(latch), thread_index)};futures.push_back(task.get_future());thread.push_back(std::thread(std::move(task)));}const auto starttime = std::chrono::high_resolution_clock::now();閂鎖.count_down_and_wait();對于(自動和線程:線程){線程連接()；}const auto endtime = std::chrono::high_resolution_clock::now();const auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(結束時間 - 開始時間).數數();useless_result_t 結果 = 0;對于(汽車和未來:期貨){結果 += future.get();}返回 std::make_tuple(result, elapsed);}//實用程序:采樣在 N 個線程上運行 func 所需的時間無效運行_測試(const std::function<useless_result_t(threadlatch&, unsigned)>&功能，const unsigned num_threads) {useless_result_t final_result = 0;雙平均時間 = 0.0;for (unsigned trial = 0; trial != kTimingTrialsToComputeAverage; ++trial) {const auto result_and_elapsed = run_threads(func, num_threads);const auto result = std::get(result_and_elapsed);const auto elapsed = std::get(result_and_elapsed);final_result += 結果；avgtime = (avgtime * trial + elapsed)/(trial + 1);}std::cout<<平均時間:" <<平均時間<<" 秒，無用的結果:" <<最后結果<<std::endl;}int main() {const auto cores = std::thread::hardware_concurrency();std::cout <<"硬件并發:" <<核心<vec;vec.resize((1u << 28)/sizeof(naive_int));//分配 256 mibibytesrun_tests([&](threadlatch&latch, unsigned thread_index) {返回 sample_array_threadfunc(latch, thread_index, vec);}，核心)；}{std::cout <<正在運行 cache_int 測試."<<std::endl;std::vectorvec;vec.resize((1u << 28)/sizeof(cache_int));//分配 256 mibibytesrun_tests([&](threadlatch&latch, unsigned thread_index) {返回 sample_array_threadfunc(latch, thread_index, vec);}，核心)；}{std::cout <<運行 bad_pair 測試."<<std::endl;bad_pair p;run_tests([&](threadlatch&latch, unsigned thread_index) {返回 sample_pair_threadfunc(latch, thread_index, p);}，核心)；}{std::cout <<運行 good_pair 測試."<<std::endl;good_pair p;run_tests([&](threadlatch&latch, unsigned thread_index) {返回 sample_pair_threadfunc(latch, thread_index, p);}，核心)；}}

C++17 added std::hardware_destructive_interference_size and std::hardware_constructive_interference_size. First, I thought it is just a portable way to get the size of a L1 cache line but that is an oversimplification.

Questions:

How are these constants related to the L1 cache line size?
Is there a good example that demonstrates their use cases?
Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?

解決方案

The intent of these constants is indeed to get the cache-line size. The best place to read about the rationale for them is in the proposal itself:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html

I'll quote a snippet of the rationale here for ease-of-reading:

[...] the granularity of memory that does not interfere (to the first-order) [is] commonly referred to as the cache-line size.

Uses of cache-line size fall into two broad categories:

Avoiding destructive interference (false-sharing) between objects with temporally disjoint runtime access patterns from different threads.

Promoting constructive interference (true-sharing) between objects which have temporally local runtime access patterns.

The most sigificant issue with this useful implementation quantity is the questionable portability of the methods used in current practice to determine its value, despite their pervasiveness and popularity as a group. [...]

We aim to contribute a modest invention for this cause, abstractions for this quantity that can be conservatively defined for given purposes by implementations:

Destructive interference size: a number that’s suitable as an offset between two objects to likely avoid false-sharing due to different runtime access patterns from different threads.

Constructive interference size: a number that’s suitable as a limit on two objects’ combined memory footprint size and base alignment to likely promote true-sharing between them.

In both cases these values are provided on a quality of implementation basis, purely as hints that are likely to improve performance. These are ideal portable values to use with the alignas() keyword, for which there currently exists nearly no standard-supported portable uses.

"How are these constants related to the L1 cache line size?"

In theory, pretty directly.

Assume the compiler knows exactly what architecture you'll be running on - then these would almost certainly give you the L1 cache-line size precisely. (As noted later, this is a big assumption.)

For what it's worth, I would almost always expect these values to be the same. I believe the only reason they are declared separately is for completeness. (That said, maybe a compiler wants to estimate L2 cache-line size instead of L1 cache-line size for constructive interference; I don't know if this would actually be useful, though.)

"Is there a good example that demonstrates their use cases?"

At the bottom of this answer I've attached a long benchmark program that demonstrates false-sharing and true-sharing.

It demonstrates false-sharing by allocating an array of int wrappers: in one case multiple elements fit in the L1 cache-line, and in the other a single element takes up the L1 cache-line. In a tight loop a single, a fixed element is chosen from the array and updated repeatedly.

It demonstrates true-sharing by allocating a single pair of ints in a wrapper: in one case, the two ints within the pair do not fit in L1 cache-line size together, and in the other they do. In a tight loop, each element of the pair is updated repeatedly.

Note that the code for accessing the object under test does not change; the only difference is the layout and alignment of the objects themselves.

I don't have a C++17 compiler (and assume most people currently don't either), so I've replaced the constants in question with my own. You need to update these values to be accurate on your machine. That said, 64 bytes is probably the correct value on typical modern desktop hardware (at the time of writing).

Warning: the test will use all cores on your machines, and allocate ~256MB of memory. Don't forget to compile with optimizations!

On my machine, the output is:

Hardware concurrency: 16
sizeof(naive_int): 4
alignof(naive_int): 4
sizeof(cache_int): 64
alignof(cache_int): 64
sizeof(bad_pair): 72
alignof(bad_pair): 4
sizeof(good_pair): 8
alignof(good_pair): 4
Running naive_int test.
Average time: 0.0873625 seconds, useless result: 3291773
Running cache_int test.
Average time: 0.024724 seconds, useless result: 3286020
Running bad_pair test.
Average time: 0.308667 seconds, useless result: 6396272
Running good_pair test.
Average time: 0.174936 seconds, useless result: 6668457

I get ~3.5x speedup by avoiding false-sharing, and ~1.7x speedup by ensuring true-sharing.

"Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?"

This will indeed be a problem. These constants are not guaranteed to map to any cache-line size on the target machine in particular, but are intended to be the best approximation the compiler can muster up.

This is noted in the proposal, and in the appendix they give an example of how some libraries try to detect cache-line size at compile time based on various environmental hints and macros. You are guaranteed that this value is at least alignof(max_align_t), which is an obvious lower bound.

In other words, this value should be used as your fallback case; you are free to define a precise value if you know it, e.g.:

constexpr std::size_t cache_line_size() {
#ifdef KNOWN_L1_CACHE_LINE_SIZE
  return KNOWN_L1_CACHE_LINE_SIZE;
#else
  return std::hardware_destructive_interference_size;
#endif
}

During compilation, if you want to assume a cache-line size just define KNOWN_L1_CACHE_LINE_SIZE.

Hope this helps!

Benchmark program:

#include <chrono>
#include <condition_variable>
#include <cstddef>
#include <functional>
#include <future>
#include <iostream>
#include <random>
#include <thread>
#include <vector>

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_destructive_interference_size = 64;

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_constructive_interference_size = 64;

constexpr unsigned kTimingTrialsToComputeAverage = 100;
constexpr unsigned kInnerLoopTrials = 1000000;

typedef unsigned useless_result_t;
typedef double elapsed_secs_t;

//////// CODE TO BE SAMPLED:

// wraps an int, default alignment allows false-sharing
struct naive_int {
    int value;
};
static_assert(alignof(naive_int) < hardware_destructive_interference_size, "");

// wraps an int, cache alignment prevents false-sharing
struct cache_int {
    alignas(hardware_destructive_interference_size) int value;
};
static_assert(alignof(cache_int) == hardware_destructive_interference_size, "");

// wraps a pair of int, purposefully pushes them too far apart for true-sharing
struct bad_pair {
    int first;
    char padding[hardware_constructive_interference_size];
    int second;
};
static_assert(sizeof(bad_pair) > hardware_constructive_interference_size, "");

// wraps a pair of int, ensures they fit nicely together for true-sharing
struct good_pair {
    int first;
    int second;
};
static_assert(sizeof(good_pair) <= hardware_constructive_interference_size, "");

// accesses a specific array element many times
template <typename T, typename Latch>
useless_result_t sample_array_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& vec) {
    // prepare for computation
    std::random_device rd;
    std::mt19937 mt{ rd() };
    std::uniform_int_distribution<int> dist{ 0, 4096 };

    auto& element = vec[vec.size() / 2 + thread_index];

    latch.count_down_and_wait();

    // compute
    for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
        element.value = dist(mt);
    }

    return static_cast<useless_result_t>(element.value);
}

// accesses a pair's elements many times
template <typename T, typename Latch>
useless_result_t sample_pair_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& pair) {
    // prepare for computation
    std::random_device rd;
    std::mt19937 mt{ rd() };
    std::uniform_int_distribution<int> dist{ 0, 4096 };

    latch.count_down_and_wait();

    // compute
    for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
        pair.first = dist(mt);
        pair.second = dist(mt);
    }

    return static_cast<useless_result_t>(pair.first) +
        static_cast<useless_result_t>(pair.second);
}

//////// UTILITIES:

// utility: allow threads to wait until everyone is ready
class threadlatch {
public:
    explicit threadlatch(const std::size_t count) :
        count_{ count }
    {}

    void count_down_and_wait() {
        std::unique_lock<std::mutex> lock{ mutex_ };
        if (--count_ == 0) {
            cv_.notify_all();
        }
        else {
            cv_.wait(lock, [&] { return count_ == 0; });
        }
    }

private:
    std::mutex mutex_;
    std::condition_variable cv_;
    std::size_t count_;
};

// utility: runs a given function in N threads
std::tuple<useless_result_t, elapsed_secs_t> run_threads(
    const std::function<useless_result_t(threadlatch&, unsigned)>& func,
    const unsigned num_threads) {
    threadlatch latch{ num_threads + 1 };

    std::vector<std::future<useless_result_t>> futures;
    std::vector<std::thread> threads;
    for (unsigned thread_index = 0; thread_index != num_threads; ++thread_index) {
        std::packaged_task<useless_result_t()> task{
            std::bind(func, std::ref(latch), thread_index)
        };

        futures.push_back(task.get_future());
        threads.push_back(std::thread(std::move(task)));
    }

    const auto starttime = std::chrono::high_resolution_clock::now();

    latch.count_down_and_wait();
    for (auto& thread : threads) {
        thread.join();
    }

    const auto endtime = std::chrono::high_resolution_clock::now();
    const auto elapsed = std::chrono::duration_cast<
        std::chrono::duration<double>>(
            endtime - starttime
            ).count();

    useless_result_t result = 0;
    for (auto& future : futures) {
        result += future.get();
    }

    return std::make_tuple(result, elapsed);
}

// utility: sample the time it takes to run func on N threads
void run_tests(
    const std::function<useless_result_t(threadlatch&, unsigned)>& func,
    const unsigned num_threads) {
    useless_result_t final_result = 0;
    double avgtime = 0.0;
    for (unsigned trial = 0; trial != kTimingTrialsToComputeAverage; ++trial) {
        const auto result_and_elapsed = run_threads(func, num_threads);
        const auto result = std::get<useless_result_t>(result_and_elapsed);
        const auto elapsed = std::get<elapsed_secs_t>(result_and_elapsed);

        final_result += result;
        avgtime = (avgtime * trial + elapsed) / (trial + 1);
    }

    std::cout
        << "Average time: " << avgtime
        << " seconds, useless result: " << final_result
        << std::endl;
}

int main() {
    const auto cores = std::thread::hardware_concurrency();
    std::cout << "Hardware concurrency: " << cores << std::endl;

    std::cout << "sizeof(naive_int): " << sizeof(naive_int) << std::endl;
    std::cout << "alignof(naive_int): " << alignof(naive_int) << std::endl;
    std::cout << "sizeof(cache_int): " << sizeof(cache_int) << std::endl;
    std::cout << "alignof(cache_int): " << alignof(cache_int) << std::endl;
    std::cout << "sizeof(bad_pair): " << sizeof(bad_pair) << std::endl;
    std::cout << "alignof(bad_pair): " << alignof(bad_pair) << std::endl;
    std::cout << "sizeof(good_pair): " << sizeof(good_pair) << std::endl;
    std::cout << "alignof(good_pair): " << alignof(good_pair) << std::endl;

    {
        std::cout << "Running naive_int test." << std::endl;

        std::vector<naive_int> vec;
        vec.resize((1u << 28) / sizeof(naive_int));  // allocate 256 mibibytes

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_array_threadfunc(latch, thread_index, vec);
        }, cores);
    }
    {
        std::cout << "Running cache_int test." << std::endl;

        std::vector<cache_int> vec;
        vec.resize((1u << 28) / sizeof(cache_int));  // allocate 256 mibibytes

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_array_threadfunc(latch, thread_index, vec);
        }, cores);
    }
    {
        std::cout << "Running bad_pair test." << std::endl;

        bad_pair p;

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_pair_threadfunc(latch, thread_index, p);
        }, cores);
    }
    {
        std::cout << "Running good_pair test." << std::endl;

        good_pair p;

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_pair_threadfunc(latch, thread_index, p);
        }, cores);
    }
}

這篇關于理解 std::hardware_corruption_interference_size 和 std::hardware_constructive_interference_size的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網！

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題，如果有圖片或者內容侵犯了您的權益，請聯系我們刪除處理，感謝您的支持！

久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

理解 std::hardware_corruption_interference_size 和 std::ha

問題描述

相關文檔推薦