久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

使用 python 解析非常大的 xml 文件時出現問題

Troubles while parsing with python very large xml file(使用 python 解析非常大的 xml 文件時出現問題)
本文介紹了使用 python 解析非常大的 xml 文件時出現問題的處理方法,對大家解決問題具有一定的參考價值,需要的朋友們下面隨著小編來一起學習吧!

問題描述

我有一個大的 xml 文件(大約 84MB),格式如下:

I have a large xml file (about 84MB) which is in this form:

<books>
    <book>...</book>
    ....
    <book>...</book>
</books>

我的目標是提取每一本書并獲得其屬性.我嘗試如下解析它(就像我對其他 xml 文件所做的那樣):

My goal is to extract every single book and get its properties. I tried to parse it (as I did with other xml files) as follows:

from xml.dom.minidom import parse, parseString

fd = "myfile.xml"
parser = parse(fd)
## other python code here

但代碼似乎在解析指令中失敗.為什么會發生這種情況,我該如何解決?

but the code seems to fail in the parse instruction. Why is this happening and how can I solve this?

我應該指出,該文件可能包含希臘語、西班牙語和阿拉伯語字符.

I should point out that the file may contain greek, spanish and arabic characters.

這是我在 ipython 中得到的輸出:

This is the output i got in ipython:

In [2]: fd = "myfile.xml"

In [3]: parser = parse(fd)
Killed

我想指出的是計算機在執行過程中凍結,所以這可能與內存消耗有關,如下所述.

I would like to point out that the computer freezes during the execution, so this may be related to memory consumption as stated below.

推薦答案

我強烈建議在這里使用 SAX 解析器.我不建議在任何大于幾兆字節的 XML 文檔上使用 minidom.我已經看到它使用大約 400MB 的 RAM 讀取大小約為 10MB 的 XML 文檔.我懷疑您遇到的問題是由 minidom 請求過多內存引起的.

I would strongly recommend using a SAX parser here. I wouldn't recommend using minidom on any XML document larger than a few megabytes; I've seen it use about 400MB of RAM reading in an XML document that was about 10MB in size. I suspect the problems you are having are being caused by minidom requesting too much memory.

Python 帶有一個 XML SAX 解析器.要使用它,請執行以下操作.

Python comes with an XML SAX parser. To use it, do something like the following.

from xml.sax.handlers import ContentHandler
from xml.sax import parse

class MyContentHandler(ContentHandler):
    # override various ContentHandler methods as needed...


handler = MyContentHandler()
parse("mydata.xml", handler)

您的 ContentHandler 子類將覆蓋 ContentHandler(例如 startElementstartElementNSendElementendElementNScharacters.這些處理由 SAX 解析器在讀取您的 XML 文檔時生成的事件.

Your ContentHandler subclass will override various methods in ContentHandler (such as startElement, startElementNS, endElement, endElementNS or characters. These handle events generated by the SAX parser as it reads your XML document in.

SAX 是一種比 DOM 更低級"的 XML 處理方式;除了從文檔中提取相關數據外,您的 ContentHandler 還需要跟蹤它當前包含的元素.不過,從好的方面來說,由于 SAX 解析器不會將整個文檔保存在內存中,因此它們可以處理任何大小的 XML 文檔,包括那些比您更大的文檔.

SAX is a more 'low-level' way to handle XML than DOM; in addition to pulling out the relevant data from the document, your ContentHandler will need to do work keeping track of what elements it is currently inside. On the upside, however, as SAX parsers don't keep the whole document in memory, they can handle XML documents of potentially any size, including those larger than yours.

我還沒有嘗試過其他使用 DOM 解析器(例如 lxml)來處理這種大小的 XML 文檔,但我懷疑 lxml 仍然需要相當長的時間并使用大量內存來解析您的 XML 文檔.如果每次運行代碼時都必須等待它讀取 84MB XML 文檔,這可能會減慢您的開發速度.

I haven't tried other using DOM parsers such as lxml on XML documents of this size, but I suspect that lxml will still take a considerable time and use a considerable amount of memory to parse your XML document. That could slow down your development if every time you run your code you have to wait for it to read in an 84MB XML document.

最后,我不相信你提到的希臘語、西班牙語和阿拉伯語字符會造成問題.

Finally, I don't believe the Greek, Spanish and Arabic characters you mention will cause a problem.

這篇關于使用 python 解析非常大的 xml 文件時出現問題的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題,如果有圖片或者內容侵犯了您的權益,請聯系我們刪除處理,感謝您的支持!

相關文檔推薦

Find all nodes by attribute in XML using Python 2(使用 Python 2 在 XML 中按屬性查找所有節點)
Python - How to parse xml response and store a elements value in a variable?(Python - 如何解析 xml 響應并將元素值存儲在變量中?)
How to get XML tag value in Python(如何在 Python 中獲取 XML 標記值)
How to correctly parse utf-8 xml with ElementTree?(如何使用 ElementTree 正確解析 utf-8 xml?)
Parse XML from URL into python object(將 XML 從 URL 解析為 python 對象)
Large XML File Parsing in Python(Python 中的大型 XML 文件解析)
主站蜘蛛池模板: 亚洲欧美在线一区 | 精品在线一区 | 亚洲综合视频一区 | 精品国产精品国产偷麻豆 | 国产午夜精品一区二区三区四区 | 欧美国产一区二区 | va在线 | 日本超碰| 成人激情视频网 | 亚洲品质自拍视频 | 久久亚洲春色中文字幕久久久 | 久久久久久久综合 | 色偷偷888欧美精品久久久 | 激情欧美日韩一区二区 | 久久久久久久久久久久久久久久久久久久 | 国产欧美在线一区 | 91黄在线观看 | 国产又色又爽又黄又免费 | 亚洲日韩中文字幕一区 | 亚洲毛片 | 国产精品夜色一区二区三区 | av在线成人 | 亚洲一区二区三区观看 | 国产精品福利在线 | 亚洲精品91| 成人欧美一区二区三区在线观看 | 欧美日本在线 | 国产欧美一区二区久久性色99 | 亚洲成人一区二区 | 欧美二区三区 | 亚洲va国产日韩欧美精品色婷婷 | 欧洲一级黄 | 亚洲精品国产偷自在线观看 | 91中文在线观看 | 成人午夜网站 | 精品成人av| 亚洲精品免费看 | 高清一区二区三区 | 欧美精品一区在线 | 四虎永久 | 在线观看国产h |