Recovery mode
For trusted-but-buggy input (third-party RSS / Atom feeds, legacy data migration, diagnostic UIs), opt into recovery mode:
use sup_xml::{parse_str_with_recovered, ParseOptions};
let xml = "<r>tom & jerry<unclosed>"; // bare & + missing end tag
let opts = ParseOptions { recovery_mode: true, ..Default::default() };let (doc, recovered) = parse_str_with_recovered(xml, &opts);let doc = doc.unwrap();
for err in &recovered { eprintln!("recovered: {}", err.message);}The default — recovery_mode: false — fails fast on the first
non-trivial error, which is the right behaviour for trusted internal
data and untrusted user input.
What gets preserved that other parsers drop
libxml2’s XML_PARSE_RECOVER silently corrupts text in three common
malformed-input scenarios; SupXML’s recovery mode preserves the
text in all three:
| input | libxml2 recovered text | SupXML recovered text |
|---|---|---|
<p>tom & jerry</p> | "tom jerry" (space-collapsed) | "tom & jerry" (literal) |
<p>x ]]> y</p> | "]>y" (prefix dropped) | "x ]]> y" (preserved) |
<p> followed by EOF | trailing text lost | trailing text preserved, </p> synthesised |
This is the design choice that motivates having our own recovery
mode at all — XML_PARSE_RECOVER is “be permissive and lossy”;
ours is “be permissive and don’t lose user data.” Inspect libxml2’s
behaviour against the same inputs with:
cargo bench -p sup-xml-bench --bench libxml2_recovery_inspectorWhat’s test-backed today
crates/api/tests/recovery.rs covers:
| Scenario | Behaviour | Test |
|---|---|---|
Bare & in PCDATA | preserved as literal & in text node | recover_logs_bare_ampersand_and_keeps_it_literal |
Bare & round-trip | serializes back as & | recover_bare_ampersand_serializes_back_as_amp_entity |
| Unclosed element at EOF | synthetic close inserted, one error per level | recover_logs_one_error_per_unclosed_level |
| Unclosed → serialize | output is well-formed | recover_unclosed_serializes_to_closed_xml |
Combined bare-& + unclosed | both recoveries applied in source order | recover_combined_bare_amp_and_unclosed |
| Bytes input variant | same behaviour from parse_bytes_with_recovered | parse_bytes_with_recovered_handles_bare_amp |
| Invalid UTF-8 | fatal — recovery cannot help | parse_bytes_with_recovered_rejects_invalid_utf8 |
| Line/column tracking | preserved in recovered errors | recovered_error_carries_line_info |
From the shell
# Recovery is a top-level CLI flag; works with every subcommandsup-xml --recover lint broken.xml # exit 0 if anything recoveredsup-xml --recover repair broken.xml -o clean.xml # write the cleaned treesup-xml --recover format --pretty broken.xml # pretty-print recoveredThe sup-xml repair subcommand is the recovery-mode pipeline in
binary form — parse with recovery_mode: true, write the cleaned
document back. Useful for one-shot fixes to corrupted feed files.
When NOT to use recovery mode
Recovery mode is for inputs you’ve already decided you’d rather salvage than reject. Don’t reach for it as a default because:
- It quietly accepts ambiguous input. A typo in your config file that recovery would round into a different tree shouldn’t be silently corrected.
- It hides upstream bugs. If the upstream is generating malformed XML, the right fix is to fix the upstream, not paper over it.
- Recovery isn’t deterministic across implementations. SupXML recovers more conservatively than libxml2, but a recovered tree isn’t guaranteed to round-trip to identical output across parsers.
For untrusted input (web request bodies, file uploads), prefer strict parsing + a fallback path — show the user a real error and let them re-submit, rather than silently transforming their input.