問題描述
我有一個 50 GB 的 SAS 數據集.我想在熊貓數據框中閱讀它.快速讀取 sas 數據集的最佳方法是什么.
I have a 50 gb SAS dataset. I want to read it in pandas dataframe. What is the best way to fast read the sas dataset.
我使用下面的代碼太慢了:
I used the below code which is way too slow:
import pandas as pd
df = pd.read_sas("xxxx.sas7bdat", chunksize = 10000000)
dfs = []
for chunk in df:
dfs.append(chunk)
df_final = pd.concat(dfs)
有什么方法可以更快地在 python 中讀取大型數據集?可以并行運行這個過程嗎?
Is there any way faster way to read large dataset in python? Can run this process parallely?
推薦答案
我知道這是一個很晚的回應,但我認為我的回答將對未來的讀者有用.幾個月前,當我必須讀取和處理 SAS
數據時,無論是 SAS7BDAT
還是 xpt
格式的 SAS
數據,我都是尋找可用于讀取這些數據集的不同庫和包,其中,我將庫入圍如下:
I know it's a very late response but I think my answer is going to be useful for future readers. Few months back when I had to read and process SAS
data either SAS7BDAT
or xpt
format SAS
data, I was looking for different libraries and packages available to read these datasets, among them, I shortlisted the libraries as follows:
pandas
(由于社區原因,它在高優先級列表中支持和性能)SAS7BDAT
(能夠讀取SAS7BDAT
僅限文件,最后一次發布于 2019 年 7 月)pyreadstat
(有希望的性能根據文檔以及讀取元數據的能力)
pandas
(It was on high priority list due to community support and performance)SAS7BDAT
(Is able to readSAS7BDAT
files only, and last release July 2019)pyreadstat
(Promising performance as per the documentation plus ability to read meta data)
在拿起任何包之前,我做了一些性能基準測試,雖然在發布此答案時我沒有基準測試結果,但我發現 pyreadstat
比 pandas
,(似乎它在讀取文檔中提到的數據時使用了多處理,但我不確定),并且在使用 pyreadstat
時內存消耗和占用空間要小得多對比pandas
,加上它可以讀取元數據,甚至只允許讀取元數據,所以我最終選擇了pyreadstat
.
Before picking up any package, I did some performance benchmarking, although I don't have benchmark result at the time of posting this answer, I found pyreadstat
to be faster than pandas
, (seems like it's using multiprocessing while reading the data as mentioned in the documentation but I'm not exactly sure), and also the memory consumption and the footprint was much lesser while using pyreadstat
in comparison to pandas
, plus it is able to read the metadata, and even allows to read the metadeta only, so I finally ended up picking pyreadstat
.
pyreadstat
讀取的數據也是dataframe的形式,不需要手動轉換成pandas dataframe.
The data read using pyreadstat
is also in the form of dataframe, so it doesn't need some manual conversion to pandas dataframe.
說到讀取大的SAS
數據,pyreadstat
有row_limit
和offset
參數可以用來讀取在塊中,因此內存不會成為瓶頸,此外,在讀取塊中的 SAS
數據時,您可以將每個塊轉換為分類并將其附加到結果數據中,然后再讀取另一個塊;它將壓縮數據大小,因此內存消耗極低(取決于數據,數據幀中唯一值的數量越少,內存使用量就越少).以下代碼片段可能對愿意閱讀大型 SAS
數據的人有用:
Talking about reading large SAS
data, pyreadstat
has row_limit
and offset
parameters which can be used to read in chunk, so the Memory is not going to be a bottleneck, furthermore, while reading the SAS
data in chunk, you can convert each chunk to categorical and append it to the resulting data before reading another chunk; it will compress the data size so the Memory consumption is extremely low (depends on the data, the lesser the number of unique values in the dataframe, is lesser the memory usage). The following code snippet might be useful for someone who is willing to read large SAS
data:
import pandas as pd
import pyreadstat
filename = 'foo.SAS7BDAT'
CHUNKSIZE = 50000
offset = 0
allChunk,_ = getChunk(row['filePath'], row_limit=CHUNKSIZE, row_offset=offset)
allChunk = allChunk.astype('category')
while True:
offset += CHUNKSIZE
# for xpt data, use pyreadstat.read_xpt()
chunk, _ = pyreadstat.read_sas7bdat(filename, row_limit=CHUNKSIZE, row_offset=offset)
if chunk.empty: break # if chunk is empty, it means the entire data has been read, so break
for eachCol in chunk: #converting each column to categorical
colUnion = pd.api.types.union_categoricals([allChunk[eachCol], chunk[eachCol]])
allChunk[eachCol] = pd.Categorical(allChunk[eachCol], categories=colUnion.categories)
chunk[eachCol] = pd.Categorical(chunk[eachCol], categories=colUnion.categories)
allChunk = pd.concat([allChunk, chunk]) #Append each chunk to the resulting dataframe
PS:請注意,生成的數據幀 allChunk
將所有列作為 Categorical
數據
PS: Please be noted that the resulting dataframe allChunk
is going to have all column as Categorical
data
這是針對 CDISC 的真實數據(原始和標準化)執行的一些基準測試(將文件讀取到數據幀的時間),文件大小范圍從幾 KB 到幾 MB,包括 xpt 和 sas7bdat 文件格式:
Here is some benchmark (Time to read the file to a dataframe) performed on real data (Raw and Standardized) for CDISC, the file size ranges from some KB to some MB, and includes both xpt and sas7bdat file formats:
Reading ADAE.xpt 49.06 KB for 100 loops:
Pandas Average time : 0.02232 seconds
Pyreadstat Average time : 0.04819 seconds
----------------------------------------------------------------------------
Reading ADIE.xpt 27.73 KB for 100 loops:
Pandas Average time : 0.01610 seconds
Pyreadstat Average time : 0.03981 seconds
----------------------------------------------------------------------------
Reading ADVS.xpt 386.95 KB for 100 loops:
Pandas Average time : 0.03248 seconds
Pyreadstat Average time : 0.07580 seconds
----------------------------------------------------------------------------
Reading beck.sas7bdat 14.72 MB for 50 loops:
Pandas Average time : 5.30275 seconds
Pyreadstat Average time : 0.60373 seconds
----------------------------------------------------------------------------
Reading p0_qs.sas7bdat 42.61 MB for 50 loops:
Pandas Average time : 15.53942 seconds
Pyreadstat Average time : 1.69885 seconds
----------------------------------------------------------------------------
Reading ta.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.04017 seconds
Pyreadstat Average time : 0.00152 seconds
----------------------------------------------------------------------------
Reading te.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.01052 seconds
Pyreadstat Average time : 0.00109 seconds
----------------------------------------------------------------------------
Reading ti.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.04446 seconds
Pyreadstat Average time : 0.00179 seconds
----------------------------------------------------------------------------
Reading ts.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.01273 seconds
Pyreadstat Average time : 0.00129 seconds
----------------------------------------------------------------------------
Reading t_frcow.sas7bdat 14.59 MB for 50 loops:
Pandas Average time : 7.93266 seconds
Pyreadstat Average time : 0.92295 seconds
如您所見,對于 xpt 文件,讀取文件的時間并不好,但對于 sas7bdat 文件,pyreadstat 的性能優于 pandas.
As you can see, for xpt files, the time to read the files isn't better, but for sas7bdat files, pyreadstat just outperforms pandas.
上述基準測試是在 pyreadstat 1.0.9、pandas 1.2.4 和 Python 3.7.5 上執行的.
這篇關于在python中讀取巨大的sas數據集的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!