問題描述
我有一個大小為 4 GB 的 XML 文件.我想解析它并將其轉換為數據框來處理它.但由于文件太大,以下代碼無法將文件轉換為 Pandas 數據框.代碼只是不斷加載,不提供任何輸出.但是當我將它用于較小尺寸的類似文件時,我得到了正確的輸出.
I have an XML file of size 4 GB. I want to parse it and convert it to a Data Frame to work on it. But because the file size is too large the following code is unable to convert the file to a Pandas Data Frame. The code just keeps loading and does not provide any output. But when I use it for a similar file of smaller size I obtain the correct output.
任何人都可以提出任何解決方案.也許是一個代碼可以加快從 XML 到 Data Frame 的轉換過程或將 XML 文件拆分為更小的子集.
Can anyone suggest any solution to this. Maybe a code that speeds up the process of conversion from XML to Data Frame or splitting of the XML file into smaller sub sets.
任何建議我是否應該在我的個人系統(2 GB RAM)上處理如此大的 XML 文件,或者我應該使用 Google Colab.如果是 Google Colab,那么有什么方法可以更快地將如此大的文件上傳到驅動器,從而更快地上傳到 Colab?
Any suggestion whether I should work with such large XML files on my personal system (2 GB RAM) or I should use Google Colab. If Google Colab, then is there any way to upload such large files quicker to drive and thus to Colab?
以下是我使用的代碼:
import xml.etree.ElementTree as ET
tree = ET.parse("Badges.xml")
root = tree.getroot()
#Column names for DataFrame
columns = ['row Id',"UserId",'Name','Date','Class','TagBased']
#Creating DataFrame
df = pd.DataFrame(columns = columns)
#Converting XML Tree to a Pandas DataFrame
for node in root:
row_Id = node.attrib.get("Id")
UserId = node.attrib.get("UserId")
Name = node.attrib.get("Name")
Date = node.attrib.get("Date")
Class = node.attrib.get("Class")
TagBased = node.attrib.get("TagBased")
df = df.append(pd.Series([row_Id,UserId,Name,Date,Class,TagBased], index = columns), ignore_index = True)
以下是我的 XML 文件:
Following is my XML File:
<badges>
<row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
<row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
<row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
推薦答案
考慮 iterparse
用于以增量方式構建樹的快速流處理.在每次迭代中構建一個字典列表,然后您可以將其傳遞到 pandas.DataFrame
構造函數 once 外部循環.下面調整根子節點的重復節點名稱:
Consider iterparse
for fast streaming processing that builds tree incrementally. In each iteration build a list of dictionaries that you can then pass into pandas.DataFrame
constructor once outside loop. Adjust below to name of repeating nodes of root's children:
from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd
file_path = r"/path/to/Input.xml"
dict_list = []
for _, elem in iterparse(file_path, events=("end",)):
if elem.tag == "row":
dict_list.append({'rowId': elem.attrib['Id'],
'UserId': elem.attrib['UserId'],
'Name': elem.attrib['Name'],
'Date': elem.attrib['Date'],
'Class': elem.attrib['Class'],
'TagBased': elem.attrib['TagBased']})
# dict_list.append(elem.attrib) # ALTERNATIVELY, PARSE ALL ATTRIBUTES
elem.clear()
df = pd.DataFrame(dict_list)
這篇關于Python 中的大型 XML 文件解析的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!