問題描述
我正在嘗試使用 multiprocessing.Pool
處理 tar 文件的內容.我能夠在多處理模塊中成功使用 ThreadPool 實現,但希望能夠使用進程而不是線程,因為它可能會更快并消除為 Matplotlib 處理多線程環境所做的一些更改.我收到一個錯誤,我懷疑與進程不共享地址空間有關,但我不確定如何修復它:
I'm trying to process the contents of a tarfile using multiprocessing.Pool
. I'm able to successfully use the ThreadPool implementation within the multiprocessing module, but would like to be able to use processes instead of threads as it would possibly be faster and eliminate some changes made for Matplotlib to handle the multithreaded environment. I'm getting an error that I suspect is related to processes not sharing address space, but I'm not sure how to fix it:
Traceback (most recent call last):
File "test_tarfile.py", line 32, in <module>
test_multiproc()
File "test_tarfile.py", line 24, in test_multiproc
pool.map(read_file, files)
File "/ldata/whitcomb/epd-7.1-2-rh5-x86_64/lib/python2.7/multiprocessing/pool.py", line 225, in map
return self.map_async(func, iterable, chunksize).get()
File "/ldata/whitcomb/epd-7.1-2-rh5-x86_64/lib/python2.7/multiprocessing/pool.py", line 522, in get
raise self._value
ValueError: I/O operation on closed file
實際的程序更復雜,但這是我正在做的一個重現錯誤的示例:
The actual program is more complicated, but this is an example of what I'm doing that reproduces the error:
from multiprocessing.pool import ThreadPool, Pool
import StringIO
import tarfile
def write_tar():
tar = tarfile.open('test.tar', 'w')
contents = 'line1'
info = tarfile.TarInfo('file1.txt')
info.size = len(contents)
tar.addfile(info, StringIO.StringIO(contents))
tar.close()
def test_multithread():
tar = tarfile.open('test.tar')
files = [tar.extractfile(member) for member in tar.getmembers()]
pool = ThreadPool(processes=1)
pool.map(read_file, files)
tar.close()
def test_multiproc():
tar = tarfile.open('test.tar')
files = [tar.extractfile(member) for member in tar.getmembers()]
pool = Pool(processes=1)
pool.map(read_file, files)
tar.close()
def read_file(f):
print f.read()
write_tar()
test_multithread()
test_multiproc()
我懷疑當 TarInfo
對象被傳遞到另一個進程但父 TarFile
不是時出現問題,但我不確定如何修復它在多進程情況下.我可以在不必從 tarball 中提取文件并將它們寫入磁盤的情況下執行此操作嗎?
I suspect that the something's wrong when the TarInfo
object is passed into the other process but the parent TarFile
is not, but I'm not sure how to fix it in the multiprocess case. Can I do this without having to extract files from the tarball and write them to disk?
推薦答案
您沒有將 TarInfo
對象傳遞給其他進程,而是將 tar.extractfile 的結果傳遞給其他進程(member)
進入另一個進程,其中 member
是一個 TarInfo
對象.extractfile(...)
方法返回一個類似文件的對象,其中包括一個 read()
方法,該方法對您打開的原始 tar 文件進行操作tar = tarfile.open('test.tar')
.
You're not passing a TarInfo
object into the other process, you're passing the result of tar.extractfile(member)
into the other process where member
is a TarInfo
object. The extractfile(...)
method returns a file-like object which has, among other things, a read()
method which operates upon the original tar file you opened with tar = tarfile.open('test.tar')
.
但是,您不能在另一個進程中使用來自一個進程的打開文件,您必須重新打開該文件.我用這個替換了你的 test_multiproc()
:
However, you can't use an open file from one process in another process, you have to re-open the file. I replaced your test_multiproc()
with this:
def test_multiproc():
tar = tarfile.open('test.tar')
files = [name for name in tar.getnames()]
pool = Pool(processes=1)
result = pool.map(read_file2, files)
tar.close()
并添加了這個:
def read_file2(name):
t2 = tarfile.open('test.tar')
print t2.extractfile(name).read()
t2.close()
并且能夠讓您的代碼正常工作.
and was able to get your code working.
這篇關于如何使用 Python 多處理池處理 tarfile?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!