問題描述
我有一個大的 xml 文件(大約 84MB),格式如下:
I have a large xml file (about 84MB) which is in this form:
<books>
<book>...</book>
....
<book>...</book>
</books>
我的目標是提取每一本書并獲得其屬性.我嘗試如下解析它(就像我對其他 xml 文件所做的那樣):
My goal is to extract every single book and get its properties. I tried to parse it (as I did with other xml files) as follows:
from xml.dom.minidom import parse, parseString
fd = "myfile.xml"
parser = parse(fd)
## other python code here
但代碼似乎在解析指令中失敗.為什么會發生這種情況,我該如何解決?
but the code seems to fail in the parse instruction. Why is this happening and how can I solve this?
我應該指出,該文件可能包含希臘語、西班牙語和阿拉伯語字符.
I should point out that the file may contain greek, spanish and arabic characters.
這是我在 ipython 中得到的輸出:
This is the output i got in ipython:
In [2]: fd = "myfile.xml"
In [3]: parser = parse(fd)
Killed
我想指出的是計算機在執行過程中凍結,所以這可能與內存消耗有關,如下所述.
I would like to point out that the computer freezes during the execution, so this may be related to memory consumption as stated below.
推薦答案
我強烈建議在這里使用 SAX 解析器.我不建議在任何大于幾兆字節的 XML 文檔上使用 minidom
.我已經看到它使用大約 400MB 的 RAM 讀取大小約為 10MB 的 XML 文檔.我懷疑您遇到的問題是由 minidom
請求過多內存引起的.
I would strongly recommend using a SAX parser here. I wouldn't recommend using minidom
on any XML document larger than a few megabytes; I've seen it use about 400MB of RAM reading in an XML document that was about 10MB in size. I suspect the problems you are having are being caused by minidom
requesting too much memory.
Python 帶有一個 XML SAX 解析器.要使用它,請執行以下操作.
Python comes with an XML SAX parser. To use it, do something like the following.
from xml.sax.handlers import ContentHandler
from xml.sax import parse
class MyContentHandler(ContentHandler):
# override various ContentHandler methods as needed...
handler = MyContentHandler()
parse("mydata.xml", handler)
您的 ContentHandler
子類將覆蓋 ContentHandler(例如 startElement
、startElementNS
、endElement
、endElementNS
或 characters
.這些處理由 SAX 解析器在讀取您的 XML 文檔時生成的事件.
Your ContentHandler
subclass will override various methods in ContentHandler (such as startElement
, startElementNS
, endElement
, endElementNS
or characters
. These handle events generated by the SAX parser as it reads your XML document in.
SAX 是一種比 DOM 更低級"的 XML 處理方式;除了從文檔中提取相關數據外,您的 ContentHandler 還需要跟蹤它當前包含的元素.不過,從好的方面來說,由于 SAX 解析器不會將整個文檔保存在內存中,因此它們可以處理任何大小的 XML 文檔,包括那些比您更大的文檔.
SAX is a more 'low-level' way to handle XML than DOM; in addition to pulling out the relevant data from the document, your ContentHandler will need to do work keeping track of what elements it is currently inside. On the upside, however, as SAX parsers don't keep the whole document in memory, they can handle XML documents of potentially any size, including those larger than yours.
我還沒有嘗試過其他使用 DOM 解析器(例如 lxml)來處理這種大小的 XML 文檔,但我懷疑 lxml 仍然需要相當長的時間并使用大量內存來解析您的 XML 文檔.如果每次運行代碼時都必須等待它讀取 84MB XML 文檔,這可能會減慢您的開發速度.
I haven't tried other using DOM parsers such as lxml on XML documents of this size, but I suspect that lxml will still take a considerable time and use a considerable amount of memory to parse your XML document. That could slow down your development if every time you run your code you have to wait for it to read in an 84MB XML document.
最后,我不相信你提到的希臘語、西班牙語和阿拉伯語字符會造成問題.
Finally, I don't believe the Greek, Spanish and Arabic characters you mention will cause a problem.
這篇關于使用 python 解析非常大的 xml 文件時出現問題的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!