久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

如何在 Java 中解析大 (50 GB) XML 文件

How to Parse Big (50 GB) XML Files in Java(如何在 Java 中解析大 (50 GB) XML 文件)
本文介紹了如何在 Java 中解析大 (50 GB) XML 文件的處理方法,對(duì)大家解決問(wèn)題具有一定的參考價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)吧!

問(wèn)題描述

目前我正在嘗試使用 SAX 解析器,但大約 3/4 的文件完全凍結(jié)了,我嘗試分配更多內(nèi)存等但沒(méi)有得到任何改進(jìn).

Currently im trying to use a SAX Parser but about 3/4 through the file it just completely freezes up, i have tried allocating more memory etc but not getting any improvements.

有什么辦法可以加快速度嗎?更好的方法?

Is there any way to speed this up? A better method?

將其剝離,所以我現(xiàn)在有以下代碼,當(dāng)在命令行中運(yùn)行時(shí),它仍然沒(méi)有我想要的那么快.

Stripped it to bare bones, so i now have the following code and when running in command line it still doesn't go as fast as i would like.

使用java -Xms-4096m -Xmx8192m -jar reader.jar"運(yùn)行它,我得到超過(guò)文章 700000 附近的 GC 開(kāi)銷限制

Running it with "java -Xms-4096m -Xmx8192m -jar reader.jar" i get a GC overhead limit exceeded around article 700000

主要:

public class Read {
    public static void main(String[] args) {       
       pages = XMLManager.getPages();
    }
}

XML 管理器

public class XMLManager {
    public static ArrayList<Page> getPages() {

    ArrayList<Page> pages = null; 
    SAXParserFactory factory = SAXParserFactory.newInstance();

    try {

        SAXParser parser = factory.newSAXParser();
        File file = new File("..\enwiki-20140811-pages-articles.xml");
        PageHandler pageHandler = new PageHandler();

        parser.parse(file, pageHandler);
        pages = pageHandler.getPages();

    } catch (ParserConfigurationException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }


    return pages;
    }    
}

頁(yè)面處理程序

public class PageHandler extends DefaultHandler{

    private ArrayList<Page> pages = new ArrayList<>();
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;

    public PageHandler(){
        super();
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {

        stringBuilder = new StringBuilder();

         if (qName.equals("page")){

            page = new Page();
            idSet = false;

        } else if (qName.equals("redirect")){
             if (page != null){
                 page.setRedirecting(true);
             }
        }
    }

     @Override
     public void endElement(String uri, String localName, String qName) throws SAXException {

         if (page != null && !page.isRedirecting()){

             if (qName.equals("title")){

                 page.setTitle(stringBuilder.toString());

             } else if (qName.equals("id")){

                 if (!idSet){

                     page.setId(Integer.parseInt(stringBuilder.toString()));
                     idSet = true;

                 }

             } else if (qName.equals("text")){

                 String articleText = stringBuilder.toString();

                 articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>", " "); //remove references
                 articleText = articleText.replaceAll("(?s)\{\{(.+?)\}\}", " "); //remove links underneath headings
                 articleText = articleText.replaceAll("(?s)==See also==.+", " "); //remove everything after see also
                 articleText = articleText.replaceAll("\|", " "); //Separate multiple links
                 articleText = articleText.replaceAll("\n", " "); //remove new lines
                 articleText = articleText.replaceAll("[^a-zA-Z0-9- \s]", " "); //remove all non alphanumeric except dashes and spaces
                 articleText = articleText.trim().replaceAll(" +", " "); //convert all multiple spaces to 1 space

                 Pattern pattern = Pattern.compile("([\S]+\s*){1,75}"); //get first 75 words of text
                 Matcher matcher = pattern.matcher(articleText);
                 matcher.find();

                 try {
                     page.setSummaryText(matcher.group());
                 } catch (IllegalStateException se){
                     page.setSummaryText("None");
                 }
                 page.setText(articleText);

             } else if (qName.equals("page")){

                 pages.add(page);
                 page = null;

            }
        } else {
            page = null;
        }
     }

     @Override
     public void characters(char[] ch, int start, int length) throws SAXException {
         stringBuilder.append(ch,start, length); 
     }

     public ArrayList<Page> getPages() {
         return pages;
     }
}

推薦答案

您的解析代碼可能工作正常,但是您正在加載的數(shù)據(jù)量可能太大而無(wú)法在 ArrayList.

Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList.

您需要某種管道將數(shù)據(jù)傳遞到其實(shí)際目的地,而無(wú)需任何時(shí)間一次將其全部存儲(chǔ)在內(nèi)存中.

You need some sort of pipeline to pass the data on to its actual destination without ever store it all in memory at once.

我有時(shí)對(duì)這種情況所做的類似于以下情況.

What I've sometimes done for this sort of situation is similar to the following.

創(chuàng)建處理單個(gè)元素的接口:

Create an interface for processing a single element:

public interface PageProcessor {
    void process(Page page);
}

通過(guò)構(gòu)造函數(shù)向 PageHandler 提供 this 的實(shí)現(xiàn):

Supply an implementation of this to the PageHandler through a constructor:

public class Read  {
    public static void main(String[] args) {

        XMLManager.load(new PageProcessor() {
            @Override
            public void process(Page page) {
                // Obviously you want to do something other than just printing, 
                // but I don't know what that is...
                System.out.println(page);
           }
        }) ;
    }

}


public class XMLManager {

    public static void load(PageProcessor processor) {
        SAXParserFactory factory = SAXParserFactory.newInstance();

        try {

            SAXParser parser = factory.newSAXParser();
            File file = new File("pages-articles.xml");
            PageHandler pageHandler = new PageHandler(processor);

            parser.parse(file, pageHandler);

        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

將數(shù)據(jù)發(fā)送到此處理器而不是將其放入列表中:

Send data to this processor instead of putting it in the list:

public class PageHandler extends DefaultHandler {

    private final PageProcessor processor;
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;

    public PageHandler(PageProcessor processor) {
        this.processor = processor;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
         //Unchanged from your implementation
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
         //Unchanged from your implementation
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
            //  Elide code not needing change

            } else if (qName.equals("page")){

                processor.process(page);
                page = null;

            }
        } else {
            page = null;
        }
    }

}

當(dāng)然,您可以讓您的界面處理多條記錄的塊,而不僅僅是一條記錄,并讓 PageHandler 將頁(yè)面本地收集到一個(gè)較小的列表中,并定期發(fā)送列表進(jìn)行處理并清除列表.

Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandler collect pages locally in a smaller list and periodically send the list off for processing and clear the list.

或者(也許更好)您可以實(shí)現(xiàn)此處定義的 PageProcessor 接口,并在此處構(gòu)建邏輯來(lái)緩沖數(shù)據(jù)并將其發(fā)送到塊中以進(jìn)一步處理.

Or (perhaps better) you could implement the PageProcessor interface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.

這篇關(guān)于如何在 Java 中解析大 (50 GB) XML 文件的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!

【網(wǎng)站聲明】本站部分內(nèi)容來(lái)源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問(wèn)題,如果有圖片或者內(nèi)容侵犯了您的權(quán)益,請(qǐng)聯(lián)系我們刪除處理,感謝您的支持!

相關(guān)文檔推薦

Upload progress listener not fired (Google drive API)(上傳進(jìn)度偵聽(tīng)器未觸發(fā)(Google 驅(qū)動(dòng)器 API))
Save file in specific folder with Google Drive SDK(使用 Google Drive SDK 將文件保存在特定文件夾中)
Google Drive Android API - Invalid DriveId and Null ResourceId(Google Drive Android API - 無(wú)效的 DriveId 和 Null ResourceId)
Google drive api services account view uploaded files to google drive using java(谷歌驅(qū)動(dòng)api服務(wù)賬戶查看上傳文件到谷歌驅(qū)動(dòng)使用java)
Google Drive service account returns 403 usageLimits(Google Drive 服務(wù)帳號(hào)返回 403 usageLimits)
com.google.api.client.json.jackson.JacksonFactory; missing in Google Drive example(com.google.api.client.json.jackson.JacksonFactory;Google Drive 示例中缺少)
主站蜘蛛池模板: 91精品国产乱码久久久久久久久 | 一区二区三区高清 | 91久操网| 国产小视频在线观看 | 亚洲欧洲精品一区 | 亚洲欧洲一区 | 国产日韩欧美电影 | 99re在线视频免费观看 | 日本精品网站 | 无码一区二区三区视频 | 精品久| 日韩中文字幕免费在线观看 | 午夜视频一区 | 欧美性大战久久久久久久蜜臀 | 亚洲成人av | 国产午夜精品视频 | 欧洲精品在线观看 | 国产在线精品一区二区 | 国产日韩久久久久69影院 | 韩日在线视频 | 天天插天天射天天干 | 亚洲精品乱码久久久久v最新版 | 99re视频| 亚洲协和影视 | 国产成人精品在线 | 国产情品 | 精品无码久久久久久久动漫 | 欧美精品久久 | 日韩午夜精品 | 三级黄色片在线播放 | 奇米视频777| 婷婷色国产偷v国产偷v小说 | 久久久综合久久 | 国产午夜精品一区二区三区嫩草 | 久久久这里都是精品 | 成人在线一级片 | 亚洲色图网址 | 青青草原综合久久大伊人精品 | 91视频国产一区 | 91精品国产欧美一区二区 | 波多野结衣一区二区 |