問題描述
作為長期 SAS 用戶,我正在探索切換到 python 和 pandas.
I am exploring switching to python and pandas as a long-time SAS user.
然而,今天在運行一些測試時,我很驚訝 python 在嘗試 pandas.read_csv()
一個 128mb 的 csv 文件時內存不足.它有大約 200,000 行和 200 列主要是數字數據.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv()
a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
使用 SAS,我可以將 csv 文件導入 SAS 數據集,它可以和我的硬盤一樣大.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
pandas
中有類似的東西嗎?
我經常處理大文件,但無法訪問分布式計算網絡.
I regularly work with large files and do not have access to a distributed computing network.
推薦答案
原則上不應該用完內存,但是目前read_csv
對大文件存在內存問題,原因是一些復雜的Python 內部問題(這個很模糊,但是早就知道了:http://github.com/pydata/pandas/問題/407).
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv
on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
目前還沒有完美的解決方案(這是一個乏味的解決方案:您可以將文件逐行轉錄成預先分配的 NumPy 數組或內存映射文件--np.mmap
),但這是我將在不久的將來進行的工作.另一種解決方案是讀取較小的文件(使用 iterator=True, chunksize=1000
)然后與 pd.concat
連接.當您一口氣將整個文本文件拉入內存時,問題就出現了.
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap
), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000
) then concatenate then with pd.concat
. The problem comes in when you pull the entire text file into memory in one big slurp.
這篇關于pandas 中的大而持久的 DataFrame的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!