久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

如何解析無效(壞/格式不正確)的 XML?

How to parse invalid (bad / not well-formed) XML?(如何解析無效(壞/格式不正確)的 XML?)
本文介紹了如何解析無效(壞/格式不正確)的 XML?的處理方法,對大家解決問題具有一定的參考價值,需要的朋友們下面隨著小編來一起學習吧!

問題描述

目前,我正在開發一項涉及解析我們從其他產品接收到的 XML 的功能.我決定對一些實際的客戶數據進行一些測試,看起來其他產品允許來自用戶的輸入,這些輸入應該被認為是無效的.無論如何,我仍然必須嘗試找出一種解析它的方法.我們正在使用 javax.xml.parsers.DocumentBuilder,我收到如下所示的輸入錯誤.

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.

<xml>
  ...
  <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
  ...
</xml>

如您所知,描述中似乎包含無效標簽 (<THIS-IS-PART-OF-DESCRIPTION>).現在,這個描述標簽被稱為葉子標簽,里面不應該有任何嵌套標簽.無論如何,這仍然是一個問題,并在 DocumentBuilder.parse(...)

As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)

我知道這是無效的 XML,但可以預見它是無效的.關于解析此類輸入的任何想法?

I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

推薦答案

那個XML"比 invalid 更糟糕——它格式不正確;請參閱格式正確與有效 XML.

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.

對違法行為的可預測性進行非正式評估沒有幫助.該文本數據不是 XML.沒有符合標準的 XML 工具或庫可以幫助您處理它.

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

  1. 讓供應商自行解決問題.要求格式良好的 XML.(從技術上講,短語 格式良好的 XML 是多余的,但可能有助于強調.)

  1. Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)

使用容錯標記解析器在解析為 XML 之前清理問題:

Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:

  • 獨立: xmlstarlet 具有強大的恢復和修復功能能力 來源:RomanPerekhrest

  • Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest

xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null

  • 獨立和 C/C++: HTML Tidy 有效也有 XML.Taggle 是一個端口將 TagSoup 轉換為 C++.

  • Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.

    Python: 美湯 是基于 Python 的.請參閱 解析器之間的差異部分中的注釋.另請參閱對這個問題的回答了解更多信息在 Python 中處理格式不正確的標記的建議,尤其包括 lxml 的 recover=True 選項.另請參閱this answer了解如何使用 codecs.EncodedFile() 清除非法字符.

    Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more suggestions for dealing with not-well-formed markup in Python, including especially lxml's recover=True option. See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.

    Java: TagSoup 和JSoup 專注于 HTML.FilterInputStream 可以用于預處理清理.

    Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.

    .NET:

    • XmlReaderSettings.CheckCharacters 可以被禁用以解決非法 XML 字符問題.
    • @jdweng 筆記 那 XmlReaderSettings.ConformanceLevel 可以設置為ConformanceLevel.Fragment 以便 XmlReader 可以讀取 XML 格式良好的已解析實體 缺少根元素.
    • @jdweng 還報告 XmlReader.ReadToFollowing() 有時可以用于解決 XML 語法問題,但請注意下面 #3 中的違規警告.
    • Microsoft.Language.Xml.XMLParser 被稱為錯誤-寬容".
    • XmlReaderSettings.CheckCharacters can be disabled to get past illegal XML character problems.
    • @jdweng notes that XmlReaderSettings.ConformanceLevel can be set to ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
    • @jdweng also reports that XmlReader.ReadToFollowing() can sometimes be used to work-around XML syntactical issues, but note rule-breaking warning in #3 below.
    • Microsoft.Language.Xml.XMLParser is said to be "error-tolerant".

    PHP: 參見 DOMDocument::$recover 和 libxml_use_internal_errors(true).在這里查看很好的例子.

    PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.

    Ruby: Nokogiri 支持Gentle Well-形成性".

    Ruby: Nokogiri supports "Gentle Well-Formedness".

    R:參見htmlTreeParse() 用于 R 中的容錯標記解析.

    R: See htmlTreeParse() for fault-tolerant markup parsing in R.

    Perl: 參見 XML::Liberal,一個解析損壞的 XML 的超級自由 XML 解析器".

    Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."

    將數據處理為文本使用文本編輯器手動或以編程方式使用字符/字符串函數.這樣做以編程方式可以從棘手到不可能作為看起來是什么通常不可預測 -- 規則破壞很少受規則約束.

    Process the data as text manually using a text editor or programmatically using character/string functions. Doing this programmatically can range from tricky to impossible as what appears to be predictable often is not -- rule breaking is rarely bound by rules.

    • 對于無效字符錯誤,使用正則表達式刪除/替換無效字符:

    • For invalid character errors, use regex to remove/replace invalid characters:

    • PHP: preg_replace('/[^x{0009}x{000a}x{000d}x{0020}-x{D7FF}x{E000}-x{FFFD}]+/u', ' ', $s);
    • Ruby: string.tr("^u{0009}u{000a}u{000d}u{0020}-u{D7FF}u{E000? }-u{FFFD}", ' ')
    • JavaScript: inputStr.replace(/[^x09x0Ax0Dx20-xFFx85xA0-uD7FFuE000-uFDCFuFDE0-uFFFD]/gm, '')

    對于 & 符號,使用正則表達式將匹配項替換為 &amp;: credit: blhsin,演示

    For ampersands, use regex to replace matches with &amp;: credit: blhsin, demo

    &(?!(?:#d+|#x[0-9a-f]+|w+);)
    

  • 請注意,上述正則表達式不會接受注釋或 CDATA部分考慮在內.

    Note that the above regular expressions won't take comments or CDATA sections into account.

    這篇關于如何解析無效(壞/格式不正確)的 XML?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!

    【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題,如果有圖片或者內容侵犯了您的權益,請聯系我們刪除處理,感謝您的支持!

    相關文檔推薦

    Upload progress listener not fired (Google drive API)(上傳進度偵聽器未觸發(Google 驅動器 API))
    Save file in specific folder with Google Drive SDK(使用 Google Drive SDK 將文件保存在特定文件夾中)
    Google Drive Android API - Invalid DriveId and Null ResourceId(Google Drive Android API - 無效的 DriveId 和 Null ResourceId)
    Google drive api services account view uploaded files to google drive using java(谷歌驅動api服務賬戶查看上傳文件到谷歌驅動使用java)
    Google Drive service account returns 403 usageLimits(Google Drive 服務帳號返回 403 usageLimits)
    com.google.api.client.json.jackson.JacksonFactory; missing in Google Drive example(com.google.api.client.json.jackson.JacksonFactory;Google Drive 示例中缺少)
    主站蜘蛛池模板: 国产在线观看一区二区 | 亚洲成人观看 | 精品乱码久久久久 | 99re在线播放 | 91免费在线 | 一区二区三区精品视频 | 超碰在线网站 | 久草青青草 | 中文字幕一区在线观看视频 | 最新日韩精品 | 欧美一级淫片007 | 97超碰在线免费 | 中文字幕二区 | 精品久久中文字幕 | 国产在线精品一区二区三区 | 狠狠操狠狠搞 | 99精品欧美一区二区蜜桃免费 | 欧美性生活一区二区三区 | 青草福利 | 精品欧美一区二区三区精品久久 | 色综合99| 四虎最新视频 | 国内久久| 毛片一级网站 | 欧美成人一级 | 91在线视频播放 | 日韩一区在线播放 | 天天色综 | 色综合99 | 中文字幕日韩欧美一区二区三区 | 亚洲成人国产 | 国产精品成人一区二区三区夜夜夜 | 国产精品久久久久久一级毛片 | 国产成人在线视频 | 久草久| 羞羞的视频免费看 | 国产日产久久高清欧美一区 | 国产精品久久久久久久久久三级 | 91精品久久久久久久久久 | www国产成人免费观看视频,深夜成人网 | 国产精品久久久久一区二区三区 |