天天操天天插,欧洲另类二三四区,91天堂

本文介紹了如何解析無效(壞/格式不正確)的 XML?的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

目前，我正在開發一項涉及解析我們從其他產品接收到的 XML 的功能.我決定對一些實際的客戶數據進行一些測試，看起來其他產品允許來自用戶的輸入，這些輸入應該被認為是無效的.無論如何，我仍然必須嘗試找出一種解析它的方法.我們正在使用 javax.xml.parsers.DocumentBuilder，我收到如下所示的輸入錯誤.

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.

<xml>
  ...
  <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
  ...
</xml>

如您所知，描述中似乎包含無效標簽 (<THIS-IS-PART-OF-DESCRIPTION>).現在，這個描述標簽被稱為葉子標簽，里面不應該有任何嵌套標簽.無論如何，這仍然是一個問題，并在 DocumentBuilder.parse(...)

As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)

我知道這是無效的 XML，但可以預見它是無效的.關于解析此類輸入的任何想法?

I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

推薦答案

那個XML"比 invalid 更糟糕——它格式不正確；請參閱格式正確與有效 XML.

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.

對違法行為的可預測性進行非正式評估沒有幫助.該文本數據不是 XML.沒有符合標準的 XML 工具或庫可以幫助您處理它.

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

讓供應商自行解決問題.要求格式良好的 XML.(從技術上講，短語 格式良好的 XML 是多余的，但可能有助于強調.)

Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)

使用容錯標記解析器在解析為 XML 之前清理問題:

Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:

獨立: xmlstarlet 具有強大的恢復和修復功能能力^{_{來源:RomanPerekhrest}}

Standalone: xmlstarlet has robust recovering and repair capabilities^{_{credit: RomanPerekhrest}}

xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null

獨立和 C/C++: HTML Tidy 有效也有 XML.Taggle 是一個端口將 TagSoup 轉換為 C++.

Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.

Python: 美湯是基于 Python 的.請參閱解析器之間的差異部分中的注釋.另請參閱對這個問題的回答了解更多信息在 Python 中處理格式不正確的標記的建議，尤其包括 lxml 的 recover=True 選項.另請參閱this answer了解如何使用 codecs.EncodedFile() 清除非法字符.

Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more suggestions for dealing with not-well-formed markup in Python, including especially lxml's recover=True option. See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.

Java: TagSoup 和JSoup 專注于 HTML.FilterInputStream 可以用于預處理清理.

Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.

.NET:

XmlReaderSettings.CheckCharacters 可以被禁用以解決非法 XML 字符問題.
@jdweng 筆記那 XmlReaderSettings.ConformanceLevel 可以設置為ConformanceLevel.Fragment 以便 XmlReader 可以讀取 XML 格式良好的已解析實體缺少根元素.
@jdweng 還報告 XmlReader.ReadToFollowing() 有時可以用于解決 XML 語法問題，但請注意下面 #3 中的違規警告.
Microsoft.Language.Xml.XMLParser 被稱為錯誤-寬容".

XmlReaderSettings.CheckCharacters can be disabled to get past illegal XML character problems.
@jdweng notes that XmlReaderSettings.ConformanceLevel can be set to ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
@jdweng also reports that XmlReader.ReadToFollowing() can sometimes be used to work-around XML syntactical issues, but note rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be "error-tolerant".

PHP: 參見 DOMDocument::$recover 和 libxml_use_internal_errors(true).在這里查看很好的例子.

PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.

Ruby: Nokogiri 支持Gentle Well-形成性".

Ruby: Nokogiri supports "Gentle Well-Formedness".

R:參見htmlTreeParse() 用于 R 中的容錯標記解析.

R: See htmlTreeParse() for fault-tolerant markup parsing in R.

Perl: 參見 XML::Liberal，一個解析損壞的 XML 的超級自由 XML 解析器".

Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."

將數據處理為文本使用文本編輯器手動或以編程方式使用字符/字符串函數.這樣做以編程方式可以從棘手到不可能作為看起來是什么通常不可預測 -- 規則破壞很少受規則約束.

Process the data as text manually using a text editor or programmatically using character/string functions. Doing this programmatically can range from tricky to impossible as what appears to be predictable often is not -- rule breaking is rarely bound by rules.

對于無效字符錯誤，使用正則表達式刪除/替換無效字符:

For invalid character errors, use regex to remove/replace invalid characters:

PHP: preg_replace('/[^x{0009}x{000a}x{000d}x{0020}-x{D7FF}x{E000}-x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^u{0009}u{000a}u{000d}u{0020}-u{D7FF}u{E000? }-u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^x09x0Ax0Dx20-xFFx85xA0-uD7FFuE000-uFDCFuFDE0-uFFFD]/gm, '')

對于 & 符號，使用正則表達式將匹配項替換為 &:^{_{credit: blhsin，演示}}

For ampersands, use regex to replace matches with &:^{_{credit: blhsin, demo}}

&(?!(?:#d+|#x[0-9a-f]+|w+);)

請注意，上述正則表達式不會接受注釋或 CDATA部分考慮在內.

Note that the above regular expressions won't take comments or CDATA sections into account.

這篇關于如何解析無效(壞/格式不正確)的 XML?的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網！

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題，如果有圖片或者內容侵犯了您的權益，請聯系我們刪除處理，感謝您的支持！

久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

如何解析無效(壞/格式不正確)的 XML?

問題描述

推薦答案

相關文檔推薦