問題描述
我正在使用 python 多處理庫中的 Pool 類編寫一個將在 HPC 集群上運(yùn)行的程序.
I am using the Pool class from python's multiprocessing library write a program that will run on an HPC cluster.
這是我正在嘗試做的抽象:
Here is an abstraction of what I am trying to do:
def myFunction(x):
# myObject is a global variable in this case
return myFunction2(x, myObject)
def myFunction2(x,myObject):
myObject.modify() # here I am calling some method that changes myObject
return myObject.f(x)
poolVar = Pool()
argsArray = [ARGS ARRAY GOES HERE]
output = poolVar.map(myFunction, argsArray)
函數(shù) f(x) 包含在 *.so 文件中,即它正在調(diào)用 C 函數(shù).
The function f(x) is contained in a *.so file, i.e., it is calling a C function.
我遇到的問題是每次運(yùn)行程序時輸出變量的值都不同(即使函數(shù) myObject.f() 是確定性函數(shù)).(如果我只有一個進(jìn)程,那么每次運(yùn)行程序時輸出變量都是相同的.)
The problem I am having is that the value of the output variable is different each time I run my program (even though the function myObject.f() is a deterministic function). (If I only have one process then the output variable is the same each time I run the program.)
我嘗試創(chuàng)建對象而不是將其存儲為全局變量:
I have tried creating the object rather than storing it as a global variable:
def myFunction(x):
myObject = createObject()
return myFunction2(x, myObject)
然而,在我的程序中,對象的創(chuàng)建成本很高,因此,創(chuàng)建一次 myObject 然后在每次調(diào)用 myFunction2() 時修改它要容易得多.因此,我不想每次都創(chuàng)建對象.
However, in my program the object creation is expensive, and thus, it is a lot easier to create myObject once and then modify it each time I call myFunction2(). Thus, I would like to not have to create the object each time.
你有什么建議嗎?我對并行編程很陌生,所以我可能會做錯這一切.我決定使用 Pool 類,因?yàn)槲蚁霃暮唵蔚臇|西開始.但我愿意嘗試更好的方法.
Do you have any tips? I am very new to parallel programming so I could be going about this all wrong. I decided to use the Pool class since I wanted to start with something simple. But I am willing to try a better way of doing it.
推薦答案
我正在使用 python 多處理庫中的 Pool 類來做HPC 集群上的一些共享內(nèi)存處理.
進(jìn)程不是線程!您不能簡單地將 Thread
替換為 Process
并期望所有進(jìn)程都能正常工作.進(jìn)程
不共享內(nèi)存,這意味著全局變量被復(fù)制,因此它們在原始進(jìn)程中的值不會改變.
Processes are not threads! You cannot simply replace Thread
with Process
and expect all to work the same. Process
es do not share memory, which means that the global variables are copied, hence their value in the original process doesn't change.
如果你想在進(jìn)程之間使用共享內(nèi)存那么你必須使用multiprocessing
的數(shù)據(jù)類型,例如Value
、Array
、或使用 Manager
創(chuàng)建共享列表等.
If you want to use shared memory between processes then you must use the multiprocessing
's data types, such as Value
, Array
, or use the Manager
to create shared lists etc.
您可能對 Manager.register
方法感興趣,該方法允許 Manager
創(chuàng)建共享的自定義對象(盡管它們必須是可挑選的).
In particular you might be interested in the Manager.register
method, which allows the Manager
to create shared custom objects(although they must be picklable).
但是我不確定這是否會提高性能.由于進(jìn)程之間的任何通信都需要酸洗,而酸洗通常需要更多時間,然后只是實(shí)例化對象.
However I'm not sure whether this will improve the performance. Since any communication between processes requires pickling, and pickling takes usually more time then simply instantiating the object.
請注意,您可以在創(chuàng)建 initializer 和 initargs
參數(shù)對工作進(jìn)程進(jìn)行一些初始化.org/3.3/library/multiprocessing.html#multiprocessing.pool.Pool" rel="noreferrer">Pool
.
Note that you can do some initialization of the worker processes passing the initializer
and initargs
argument when creating the Pool
.
例如,以最簡單的形式,在工作進(jìn)程中創(chuàng)建一個全局變量:
For example, in its simplest form, to create a global variable in the worker process:
def initializer():
global data
data = createObject()
用作:
pool = Pool(4, initializer, ())
那么worker函數(shù)就可以放心的使用data
全局變量了.
Then the worker functions can use the data
global variable without worries.
樣式說明:從不為您的變量/模塊使用內(nèi)置名稱.在您的情況下, object
是內(nèi)置的.否則,您最終會遇到意想不到的錯誤,這些錯誤可能晦澀難懂且難以追蹤.
Style note: Never use the name of a built-in for your variables/modules. In your case object
is a built-in. Otherwise you'll end up with unexpected errors which may be obscure and hard to track down.
這篇關(guān)于具有全局變量的 multiprocessing.Pool的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!