問題描述
我有 200,000 個 XML 文件要解析并存儲在數(shù)據(jù)庫中.
I have 200,000 XML files I want to parse and store in a database.
這里是一個例子:https://gist.github.com/902292
這與 XML 文件一樣復(fù)雜.這也將在小型 VPS (Linode) 上運(yùn)行,因此內(nèi)存很緊.
This is about as complex as the XML files get. This will also run on a small VPS (Linode) so memory is tight.
我想知道的是:
1) 我應(yīng)該使用 DOM 還是 SAX 解析器?由于每個 XML 都很小,因此 DOM 似乎更容易和更快.
1) Should I use a DOM or SAX parser? DOM seems easier and faster since each XML is small.
2) 關(guān)于所述解析器的簡單教程在哪里?(DOM 或 SAX)
2) Where is a simple tutorial on said parser? (DOM or SAX)
謝謝
編輯
盡管每個人都建議使用 SAX,但我嘗試了 DOM 路由.主要是因?yàn)槲艺业搅艘粋€更簡單"的 DOM 教程,并且我認(rèn)為由于平均文件大小約為 3k - 4k,因此很容易將其保存在內(nèi)存中.
I tried the DOM route even though everyone suggested SAX. Mainly because I found an "easier" tutorial for DOM and I thought that since the average file size was about 3k - 4k it would easily be able to hold that in memory.
但是,我編寫了一個遞歸例程來處理所有 200k 文件,它完成了大約 40% 的文件,然后 Java 內(nèi)存不足.
However, I wrote a recursive routine to handle all 200k files and it gets about 40% of the way through them and then Java runs out of memory.
這是項(xiàng)目的一部分.https://gist.github.com/905550#file_xm_lparser.java
我現(xiàn)在應(yīng)該放棄 DOM 而只使用 SAX 嗎?看起來如此小的文件 DOM 應(yīng)該能夠處理它.
Should I ditch DOM now and just use SAX? Just seems like with such small files DOM should be able to handle it.
此外,速度足夠快".解析 2000 個 XML 文件大約需要 19 秒(在 Mongo 插入之前).
Also, the speed is "fast enough". It's taking about 19 seconds to parse 2000 XML files (before the Mongo insert).
謝謝
推薦答案
SAX 總是在速度上擊敗 DOM.但是由于您說 XML 文件很小,您可以繼續(xù)使用 DOM 解析器.您可以做的一件事是創(chuàng)建一個線程池并在其中執(zhí)行數(shù)據(jù)庫操作.多線程更新將顯著提高性能.
SAX always beats DOM at speed. But since you say XML files are small you may proceed with DOM parser. One thing you can do to speedup is create a Thread-Pool and do the database operations in it. Multithreaded updates will significantly improve the performance.
- 拉利斯
這篇關(guān)于如何在 Java 中高效地解析 200,000 個 XML 文件?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!