問(wèn)題描述
讓我們定義:
來(lái)自多處理導(dǎo)入池將 numpy 導(dǎo)入為 np定義函數(shù)(x):對(duì)于我在范圍內(nèi)(1000):我**2返回 1
注意 func()
做了一些事情,它總是返回一個(gè)小數(shù)字 1
.
然后,我比較 8 核并行 Pool.map()
與串行、python 內(nèi)置 map()
n=10**3a=np.random.random(n).tolist()使用 Pool(8) 作為 p:%timeit -r1 -n2 p.map(func,a)%timeit -r1 -n2 列表(地圖(函數(shù),a))
這給出了:
38.4 ms ± 0 ns 每個(gè)循環(huán)(平均值 ± 標(biāo)準(zhǔn)偏差.1 次運(yùn)行,每次 2 個(gè)循環(huán))每個(gè)循環(huán) 200 ms ± 0 ns(平均值 ± 標(biāo)準(zhǔn)偏差.1 次運(yùn)行,每個(gè)循環(huán) 2 個(gè))
這顯示了相當(dāng)好的并行縮放.因?yàn)槲矣玫氖?核,38.3 [ms]
大概是200[s]
然后讓我們嘗試 Pool.map()
在一些更大的東西的列表上,為簡(jiǎn)單起見(jiàn),我以這種方式使用列表列表:
n=10**3m=10**4a=np.random.random((n,m)).tolist()使用 Pool(8) 作為 p:%timeit -r1 -n2 p.map(func,a)%timeit -r1 -n2 列表(地圖(函數(shù),a))
給出:
292 ms ± 0 ns 每個(gè)循環(huán)(平均值 ± 標(biāo)準(zhǔn)偏差.1 次運(yùn)行,每次 2 次循環(huán))每個(gè)循環(huán) 209 ms ± 0 ns(平均值 ± 標(biāo)準(zhǔn)偏差.1 次運(yùn)行,每個(gè)循環(huán) 2 個(gè))
你看,并行擴(kuò)展已經(jīng)不復(fù)存在了!1s ~ 1.76s
我們可以讓它變得更糟,嘗試讓每個(gè)子列表通過(guò)更大:
n=10**3m=10**5a=np.random.random((n,m)).tolist()使用 Pool(8) 作為 p:%timeit -r1 -n2 p.map(func,a)%timeit -r1 -n2 列表(地圖(函數(shù),a))
這給出了:
3.29 s ± 0 ns 每個(gè)循環(huán)(平均值 ± 標(biāo)準(zhǔn)偏差.1 次運(yùn)行,每次 2 次循環(huán))每個(gè)循環(huán) 179 ms ± 0 ns(平均值 ± 標(biāo)準(zhǔn)偏差.1 次運(yùn)行,每個(gè)循環(huán) 2 個(gè))
哇,再大的子列表,計(jì)時(shí)結(jié)果完全顛倒了.我們使用 8 個(gè)核心來(lái)獲得慢 20 倍的時(shí)序!!
您還可以注意到串行 map()
的時(shí)序與子列表大小無(wú)關(guān).所以一個(gè)合理的解釋是,Pool.map()
真的是在圍繞導(dǎo)致額外復(fù)制的進(jìn)程傳遞那些大子列表的內(nèi)容?
我不確定.但如果是這樣,為什么它不傳遞子列表的地址?畢竟子列表已經(jīng)在內(nèi)存中了,在實(shí)踐中我使用的func()
保證不會(huì)改變/修改子列表.
那么,在 python 中,當(dāng)在大型事物列表上映射某些操作時(shí),保持并行縮放的正確方法是什么?
在我們開(kāi)始之前
并深入研究任何納秒(沒(méi)錯(cuò),它很快就會(huì)開(kāi)始,因?yàn)槊總€(gè) [ns]
都很重要,因?yàn)榭s放會(huì)打開(kāi)整個(gè)潘多拉盒子的問(wèn)題),讓我們就比例達(dá)成一致 - 最簡(jiǎn)單且通常 便宜" 一旦問(wèn)題規(guī)模擴(kuò)大到現(xiàn)實(shí)規(guī)模,過(guò)早的技巧可能而且經(jīng)常會(huì)破壞你的夢(mèng)想 - 成千上萬(wàn)(在上面的兩個(gè)迭代器中看到)對(duì)于 緩存計(jì)算 與 <0.5 [ns]
次數(shù)據(jù)提取,比一次增長(zhǎng)超過(guò) L1/L2/L3 緩存大小1E+5、1E+6、1E+9、
code> 高于 [GB]
s,where 每個(gè)未對(duì)齊的 fetch 比幾個(gè) 100 [ns] 貴得多
Q : "...因?yàn)槲矣?8 個(gè)內(nèi)核,我想用它們來(lái)提高 8 倍的速度"
我希望你可以,確實(shí).然而,很抱歉直截了當(dāng)?shù)卣f(shuō)實(shí)話,世界不是這樣運(yùn)作的.
<塊引用>查看這個(gè)交互式工具,它將向您顯示加速限制及其對(duì)實(shí)際生產(chǎn)成本的主要依賴性-初始問(wèn)題的世界縮放,因?yàn)樗鼜奈⒉蛔愕赖拇笮『瓦@些組合效果按比例增長(zhǎng) 只需單擊-它并播放 使用滑塊實(shí)時(shí)查看它的實(shí)際效果:
Q : (is)Pool.map()
確實(shí)將那些大子列表的內(nèi)容傳遞到導(dǎo)致額外的副本?
是的,
它必須按照設(shè)計(jì)這樣做
此外,它通過(guò)將所有數(shù)據(jù)傳遞通過(guò)"另一個(gè)昂貴" SER/DES 處理,
以便實(shí)現(xiàn)交付那里".
只要您嘗試過(guò),反之亦然返回 "back" 一些乳齒象大小的結(jié)果,你沒(méi)有,在上面.
Q:如果是這樣,為什么不傳遞子列表的地址?
因?yàn)檫h(yuǎn)程(參數(shù)接收)進(jìn)程是另一個(gè)完全自治的進(jìn)程,具有自己的、獨(dú)立的和受保護(hù)的,地址空間我們不能只傳遞一個(gè)地址引用 "into",我們希望它是一個(gè)完全獨(dú)立、自主工作的 python 進(jìn)程(因?yàn)樵敢馐褂眠@個(gè)技巧來(lái)逃避
A )
了解避免或至少減少開(kāi)支的方法:
了解所有類型的 您必須支付和將支付的費(fèi)用:
花費(fèi)盡可能少流程實(shí)例化成本盡可能(相當(dāng)昂貴)最好只作為一次性成本
<塊引用>在 macOS 上,
spawn
現(xiàn)在是默認(rèn)啟動(dòng)方法.fork
start 方法應(yīng)該被認(rèn)為是不安全的,因?yàn)樗赡軐?dǎo)致子進(jìn)程崩潰.請(qǐng)參閱 bpo-33725.盡可能少地花費(fèi)參數(shù)傳遞成本(是的,最好避免重復(fù)傳遞那些大東西"作為參數(shù))
李>- 永遠(yuǎn)不要將資源浪費(fèi)在不能執(zhí)行您的工作的事情上 - (永遠(yuǎn)不要產(chǎn)生比
len( os.sched_getaffinity( 0 ) 報(bào)告的更多的進(jìn)程)
- 任何進(jìn)程不止于此,它將等待其下一個(gè) CPU 核心插槽,并且將驅(qū)逐其他緩存效率高的進(jìn)程,從而重新支付所有已支付的獲取成本,以再次重新獲取所有數(shù)據(jù)以便camp-em回到緩存中很快就會(huì)再次被驅(qū)逐出緩存中計(jì)算,而到目前為止以這種方式工作的那些進(jìn)程被正確驅(qū)逐(有什么好處?)通過(guò)天真的使用多達(dá)multiprocessing.cpu_count()
-reported 進(jìn)程,在最初的Pool
-creation 中產(chǎn)生的代價(jià)非常高昂) - 重復(fù)使用預(yù)先分配的內(nèi)存,而不是繼續(xù)花費(fèi)臨時(shí)內(nèi)存分配成本 ALAP
- 永遠(yuǎn)不要分享一點(diǎn),如果性能是目標(biāo)
- 從不阻塞,從不 - 無(wú)論是 python
gc
,如果不避免可能會(huì)阻塞,或者Pool.map()
哪個(gè)會(huì)阻止
B )
了解提高效率的方法 :
了解所有提高效率的技巧,即使以代碼復(fù)雜性為代價(jià)(一些 SLOC-s 很容易在教科書(shū)中展示,但同時(shí)犧牲了效率和性能 - 盡管這兩者都是 你的主要敵人,在整個(gè)scaling(問(wèn)題大小或迭代深度,或者同時(shí)增長(zhǎng)兩者).
A 中的某些類別的實(shí)際成本大幅改變了限制 進(jìn)入某種形式的 [PARALLEL]
流程編排可以預(yù)期的理論上可實(shí)現(xiàn)的加速(這里,使代碼執(zhí)行的某些部分在生成的子中執(zhí)行-processes ),其最初的觀點(diǎn)早在 60 多年前由 Gene Amdahl 博士首次提出(最近添加了兩個(gè)與流程實(shí)例化相關(guān)的主要擴(kuò)展設(shè)置 + termination 增加成本(在 py2 always 和 py3.5+ 中對(duì)于 MacOS 和 Windows 非常重要)和 原子性-of-work
,這將在下面討論.
阿姆達(dá)爾定律加速 S 的開(kāi)銷嚴(yán)格重新制定:
S = N 個(gè)處理器可以實(shí)現(xiàn)的加速s = 計(jì)算的比例,即 [SERIAL]1-s = 可并行化的部分,可以運(yùn)行 [PAR]N = 積極參與 [PAR] 處理的處理器(CPU 核心)數(shù)量1S = __________________________;其中 s, ( 1 - s ), N 在上面定義( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on cost/latencys + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead 附加成本/延遲?
開(kāi)銷嚴(yán)格和資源感知的重新制定:
1 其中 s, ( 1 - s ), NS = ______________________________________________ ;pSO, pTO|( 1 - s ) |上面定義的s + pSO + 最大值|_________ , atomicP |+ pTO atomicP:= 一個(gè)工作單元,|N |進(jìn)一步不可分割,持續(xù)時(shí)間原子進(jìn)程塊
<小時(shí)>
使用你的 python 在目標(biāo) CPU/RAM 設(shè)備上制作原型,縮放 >>1E+6
任何簡(jiǎn)化的模型示例都會(huì)以某種方式扭曲您對(duì)實(shí)際工作負(fù)載如何在體內(nèi)執(zhí)行的預(yù)期.低估的 RAM 分配,在小規(guī)模上看不到,后來(lái)可能會(huì)在規(guī)模上大吃一驚,有時(shí)甚至?xí)共僮飨到y(tǒng)進(jìn)入緩慢狀態(tài),交換和顛簸.一些更智能的工具( numba.jit()
)甚至可以分析代碼并縮短一些代碼段落,這些段落永遠(yuǎn)不會(huì)被訪問(wèn)或不會(huì)產(chǎn)生任何結(jié)果,因此請(qǐng)注意簡(jiǎn)化示例可能會(huì)導(dǎo)致令人驚訝的觀察.
來(lái)自多處理導(dǎo)入池將 numpy 導(dǎo)入為 np導(dǎo)入操作系統(tǒng)比例 = 整數(shù)(1E9)STEP = int(1E1)aLIST = np.random.random( ( 10**3, 10**4 ) ).tolist()####################################################################################### func() 做了一些 SCALE 的工作,然而# 傳遞幾乎零字節(jié)作為參數(shù)# 什么都不分配,但迭代器# 返回一個(gè)字節(jié),# 對(duì)任何昂貴的輸入都是不變的定義函數(shù)(x):對(duì)于我在范圍內(nèi)(SCALE):我**2返回 1
關(guān)于使擴(kuò)展策略降低間接成本的一些提示:
##################################################################################### more_work_en_block() 包裝了一些 SCALE 的工作量,指定的子列表def more_work_en_block(en_block = [無(wú),]):return [ func( nth_item ) for nth_item in en_block ]
如果確實(shí)必須傳遞一個(gè)大列表,最好傳遞更大的塊,遠(yuǎn)程迭代其部分(而不是為每個(gè)傳遞的每個(gè)項(xiàng)目支付傳輸成本,而不是使用 sub_blocks
(參數(shù)得到 SER/DES 處理(~ pickle.dumps()
+ pickle.loads()
的成本)[每次調(diào)用],再次,在附加成本,這會(huì)降低最終的效率并惡化擴(kuò)展的、嚴(yán)格管理費(fèi)用的阿姆達(dá)爾定律的管理費(fèi)用部分)
##################################################################################### some_work_en_block() 包裝了一些 SCALE 的工作量,元組指定def some_work_en_block( sub_block = ( [ 無(wú), ], 0, 1 ) ):返回 more_work_en_block(en_block = sub_block[0][sub_block[1]:sub_block[2]])
<小時(shí)>
調(diào)整流程實(shí)例的數(shù)量:
aMaxNumOfProcessesThatMakesSenseToSPAWN = len( os.sched_getaffinity( 0 ) ) # 不再使用 Pool(aMaxNumOfProcessesThatMakesSenseToSPAWN) 作為 p:p.imap_unordered(more_work_en_block, [ ( aLIST,開(kāi)始,開(kāi)始+步驟)在范圍內(nèi)開(kāi)始(0,len(aLIST),STEP)])
最后但同樣重要的是,預(yù)計(jì)智能使用 numpy
智能矢量化代碼可以極大地提升性能,最好不要重復(fù)傳遞靜態(tài)、預(yù)復(fù)制(在進(jìn)程實(shí)例化期間),因此支付為合理縮放的(此處不可避免的)成本)BLOB,在代碼中使用,無(wú)需通過(guò)參數(shù)傳遞傳遞相同的數(shù)據(jù),以向量化(CPU 非常高效)的方式作為只讀數(shù)據(jù).如何使 ~ +500 x
加速的一些示例可以閱讀這里 或 here,大約但是 ~ +400 x
加速 或大約一個(gè)大約 ~ +100 x
加速 的案例,以及一些問(wèn)題隔離的示例 測(cè)試場(chǎng)景.
無(wú)論如何,模型代碼越接近您的實(shí)際工作負(fù)載,基準(zhǔn)測(cè)試就越有意義(在規(guī)模和生產(chǎn)中).
祝你探索世界好運(yùn),就像它一樣,
如果它不同,就不是夢(mèng)想,
不是希望它不同或我們希望它是不同的
:o)
事實(shí)和科學(xué)很重要 - 兩者 + 一起
證據(jù)記錄是實(shí)現(xiàn)盡可能高績(jī)效的核心步驟,
沒(méi)有任何產(chǎn)品營(yíng)銷,
沒(méi)有任何傳福音氏族戰(zhàn)爭(zhēng),
沒(méi)有任何博客帖子的喋喋不休
至少不要說(shuō)你沒(méi)有被警告
:o)
Let us define :
from multiprocessing import Pool
import numpy as np
def func(x):
for i in range(1000):
i**2
return 1
Notice that func()
does something and it always returns a small number 1
.
Then, I compare an 8-core parallel Pool.map()
v/s a serial, python built in, map()
n=10**3
a=np.random.random(n).tolist()
with Pool(8) as p:
%timeit -r1 -n2 p.map(func,a)
%timeit -r1 -n2 list(map(func,a))
This gives :
38.4 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)
200 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)
which shows quite good parallel scaling. Because I use 8 cores, and 38.3 [ms]
is roughly 1/8 of 200[s]
Then let us try Pool.map()
on lists of some bigger things, for simplicity, I use a list-of-lists this way :
n=10**3
m=10**4
a=np.random.random((n,m)).tolist()
with Pool(8) as p:
%timeit -r1 -n2 p.map(func,a)
%timeit -r1 -n2 list(map(func,a))
which gives :
292 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)
209 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)
You see, parallel scaling is gone! 1s ~ 1.76s
We can make it much worse, try to make each sub list to pass even bigger :
n=10**3
m=10**5
a=np.random.random((n,m)).tolist()
with Pool(8) as p:
%timeit -r1 -n2 p.map(func,a)
%timeit -r1 -n2 list(map(func,a))
This gives :
3.29 s ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)
179 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)
Wow, with even larger sub lists, the timing result is totally reversed. We use 8 cores to get a 20 times slower timing!!
You can also notice the serial map()
's timing has nothing to do with a sub list size. So a reasonable explanation would be that Pool.map()
are really passing the content of those big sub list around processes which cause additional copy?
I am not sure. But if so, why doesn't it passing the address of sub-list? After all, the sub-list is already in the memory, and in practice the func()
I used is guaranteed not to change/modify the sub-list.
So, in python, what is the correct way to keep parallel scaling when mapping some operations on a list of large things?
Before we start
and dive deeper into any hunt for nanoseconds ( and right, it will soon start, as each [ns]
matters as the scaling opens the whole Pandora Box of the problems ), lets agree on the scales - most easy and often "cheap" premature tricks may and often will derail your dreams once the scales of the problem size have grown into realistic scales - the thousands (seen above in both iterators) behave way different for in-cache computing with < 0.5 [ns]
data-fetches, than once having grown beyond the L1/L2/L3-cache-sizes for scales above 1E+5, 1E+6, 1E+9,
above [GB]
s, where each mis-aligned fetch is WAY more EXPENSIVE, than a few 100 [ns]
Q : "... because I have 8 cores, I want to use them to get 8 times faster"
I wish you could, indeed. Yet, sorry for telling the truth straight, the World does not work this way.
See this interactive tool, it will show you both the speedup limits and their principal dependence on the actual production costs of the real-world scaling of the initial problem, as it grows from trivial sizes and these combined effects at scale just click-it and play with the sliders to see it live, in action :
Q : (is)
Pool.map()
really passing the content of those big sub list around processes which cause additional copy?
Yes,
it must do so, by design
plus it does that by passing all that data "through" another "expensive" SER/DES processing,
so as to make it happen delivered "there".
The very same would apply vice-versa whenever you would have tried to return "back" some mastodon-sized result(s), which you did not, here above.
Q : But if so, why doesn't it passing the address of sub-list?
Because the remote ( parameter-receiving ) process is another, fully autonomous process, with its own, separate and protected, address-space we cannot just pass an address-reference "into", and we wanted that to be a fully independent, autonomously working python process ( due to a will to use this trick so as to escape from GIL-lock dancing ), didn't we? Sure we did - this is a central step of our escape from the GIL-Wars ( for better understanding of the GIL-lock pros and cons, may like this and this ( Pg.15+ on CPU-bound processing ).
0.1 ns - NOP
0.3 ns - XOR, ADD, SUB
0.5 ns - CPU L1 dCACHE reference (1st introduced in late 80-ies )
0.9 ns - JMP SHORT
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
?~~~~~~~~~~~ 1 ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~
3~4 ns - CPU L2 CACHE reference (2020/Q1)
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
10 ns - DIV
19 ns - CPU L3 CACHE reference (2020/Q1 considered slow on 28c Skylake)
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1K bytes with a Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
?~~~ 2,500,000 ns - Read 10 MB sequentially from MEMORY~~(about an empty python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s), yet an empty python interpreter is indeed not a real-world, production-grade use-case, is it?
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
?~~ 25,000,000 ns - Read 100 MB sequentially from MEMORY~~(somewhat light python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s)
30,000,000 ns - Read 1 MB sequentially from a DISK
?~~ 36,000,000 ns - Pickle.dump() SER a 10 MB object for IPC-transfer and remote DES in spawned process~~~~~~~~ x ( 2 ) for a single 10MB parameter-payload SER/DES + add an IPC-transport costs thereof or NETWORK-grade transport costs, if going into [distributed-computing] model Cluster ecosystem
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
Q : " what is the correct way to keep parallel scaling when parallel mapping some operations on a list of large things? "
A )
UNDERSTAND THE WAYS TO AVOID OR AT LEAST REDUCE EXPENSES :
Understand all the types of the costs you have to pay and will pay :
spend as low process instantiation costs as possible (rather expensive ) best as a one-time cost only
On macOS, the
spawn
start method is now the default. Thefork
start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725.spend as small amount of costs of parameter-passing as you must ( yes, best avoid repetitive passing those "large things" as parameters )
- never waste resources on things that do not perform your job - ( never spawn more processes than was reported by
len( os.sched_getaffinity( 0 ) )
- any process more than this will but wait for its next CPU-core-slot, and will but evict other, cache-efficient process, thus re-paying all the fetch-costs once already paid to re-fetch again all data so to camp-em back in-cache for a soon to get evicted again in-cache computing, while those processes that worked so far this way were right evicted (for what good?) by a naive use of as many asmultiprocessing.cpu_count()
-reported processes, so expensively spawned in the initialPool
-creation ) - better re-use a pre-allocated memory, than keep spending ad-hoc memory allocation costs ALAP
- never share a bit, if The Performance is the goal
- never block, never - be it python
gc
which may block if not avoided, orPool.map()
which blocks either
B )
UNDERSTAND THE WAYS TO INCREASE THE EFFICIENCY :
Understand all efficiency increasing tricks, even at a cost of complexity of code ( a few SLOC-s are easy to show in school-books, yet sacrificing both the efficiency and the performance - in spite of these both being your main enemy in a fight for a sustainable performance throughout the scaling ( of either of problem size or iteration depths, or when growing both of them at the same time ).
Some categories of the real-world costs from A ) have dramatically changed the limits of the theoretically achievable speedups to be expected from going into some form of [PARALLEL]
process orchestrations ( here, making some parts of the code-execution got executed in the spawned sub-processes ), the initial view of which was first formulated by Dr. Gene Amdahl as early as 60+ years ago ( for which there were recently added two principal extensions of both the process instantiation(s) related setup + termination add on costs ( extremely important in py2 always & py3.5+ for MacOS and Windows ) and an atomicity-of-work
, which will be discussed below.
Overhead-strict re-formulation of the Amdahl's Law speedup S:
S = speedup which can be achieved with N processors
s = a proportion of a calculation, which is [SERIAL]
1-s = a parallelizable portion, that may run [PAR]
N = a number of processors ( CPU-cores ) actively participating on [PAR] processing
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on cost/latency
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on cost/latency
N
Overhead-strict and resources-aware re-formulation:
1 where s, ( 1 - s ), N
S = ______________________________________________ ; pSO, pTO
| ( 1 - s ) | were defined above
s + pSO + max| _________ , atomicP | + pTO atomicP:= a unit of work,
| N | further indivisible,
a duration of an
atomic-process-block
Prototype on target CPU/RAM device with your python, scaled >>1E+6
Any simplified mock-up example will somehow skew your expectations about how the actual workloads will perform in-vivo. Underestimated RAM-allocations, not seen at small-scales may later surprise at scale, sometimes even throwing the operating system into sluggish states, swapping and thrashing. Some smarter tools ( numba.jit()
) may even analyze the code and shortcut some passages of code, that will never be visited or that does not produce any result, so be warned that simplified examples may lead to surprising observations.
from multiprocessing import Pool
import numpy as np
import os
SCALE = int( 1E9 )
STEP = int( 1E1 )
aLIST = np.random.random( ( 10**3, 10**4 ) ).tolist()
#######################################################################################
# func() does some SCALE'd amount of work, yet
# passes almost zero bytes as parameters
# allocates nothing, but iterator
# returns one byte,
# invariant to any expensive inputs
def func( x ):
for i in range( SCALE ):
i**2
return 1
A few hints on making the strategy of scaling less overhead-costs expensive :
#####################################################################################
# more_work_en_block() wraps some SCALE'd amount of work, sub-list specified
def more_work_en_block( en_block = [ None, ] ):
return [ func( nth_item ) for nth_item in en_block ]
If indeed must pass a big list, better pass larger block, with remote-iterating its parts ( instead of paying transfer-costs for each and every item passed many many more times, than if using sub_blocks
( parameters get SER/DES processed ( ~ the costs of pickle.dumps()
+ pickle.loads()
) [per-each-call], again, at an add-on costs, that decrease the resulting efficiency and worsen the overheads part of the extended, overhead-strict Amdahl's Law )
#####################################################################################
# some_work_en_block() wraps some SCALE'd amount of work, tuple-specified
def some_work_en_block( sub_block = ( [ None, ], 0, 1 ) ):
return more_work_en_block( en_block = sub_block[0][sub_block[1]:sub_block[2]] )
Right-sizing the number of process-instances :
aMaxNumOfProcessesThatMakesSenseToSPAWN = len( os.sched_getaffinity( 0 ) ) # never more
with Pool( aMaxNumOfProcessesThatMakesSenseToSPAWN ) as p:
p.imap_unordered( more_work_en_block, [ ( aLIST,
start,
start + STEP
)
for start in range( 0, len( aLIST ), STEP ) ] )
Last but not least, expect immense performance boosts from smart use of numpy
smart vectorised code, best without repetitive passing of static, pre-copied (during the process instantiation(s), thus paid as the reasonably scaled, here un-avoidable, cost of thereof ) BLOBs, used in the code without passing the same data via parameter-passing, in a vectorised ( CPU-very-efficient ) fashion as read-only data. Some examples on how one can make ~ +500 x
speedup one may read here or here, about but ~ +400 x
speedup or about a case of just about a ~ +100 x
speedup, with some examples of some problem-isolation testing scenarios.
Anyway, the closer will the mock-up code be to your actual workloads, the more sense the benchmarks will get to have ( at scale & in production ).
Good luck on exploring the World, as it is,
not as a dream if it were different,
not as a wish it were different or that we would like it to be
:o)
Facts and Science matter - both + together
Records of Evidence are the core steps forwards to achieve as high performance as possible,
not any Product Marketing,
not any Evangelisation Clans wars,
not any Blog-posts' chatter
At least don't say you was not warned
:o)
這篇關(guān)于大型對(duì)象列表上的多處理 Pool.map() 縮放不良:如何在 python 中實(shí)現(xiàn)更好的并行縮放?的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!