Skip to content

Recovery mode

For trusted-but-buggy input (third-party RSS / Atom feeds, legacy data migration, diagnostic UIs), opt into recovery mode:

use sup_xml::{parse_str_with_recovered, ParseOptions};
let xml = "<r>tom & jerry<unclosed>"; // bare & + missing end tag
let opts = ParseOptions { recovery_mode: true, ..Default::default() };
let (doc, recovered) = parse_str_with_recovered(xml, &opts);
let doc = doc.unwrap();
for err in &recovered {
eprintln!("recovered: {}", err.message);
}

The default — recovery_mode: false — fails fast on the first non-trivial error, which is the right behaviour for trusted internal data and untrusted user input.

What gets preserved that other parsers drop

libxml2’s XML_PARSE_RECOVER silently corrupts text in three common malformed-input scenarios; SupXML’s recovery mode preserves the text in all three:

inputlibxml2 recovered textSupXML recovered text
<p>tom & jerry</p>"tom jerry" (space-collapsed)"tom & jerry" (literal)
<p>x ]]> y</p>"]>y" (prefix dropped)"x ]]> y" (preserved)
<p> followed by EOFtrailing text losttrailing text preserved, </p> synthesised

This is the design choice that motivates having our own recovery mode at all — XML_PARSE_RECOVER is “be permissive and lossy”; ours is “be permissive and don’t lose user data.” Inspect libxml2’s behaviour against the same inputs with:

Terminal window
cargo bench -p sup-xml-bench --bench libxml2_recovery_inspector

What’s test-backed today

crates/api/tests/recovery.rs covers:

ScenarioBehaviourTest
Bare & in PCDATApreserved as literal & in text noderecover_logs_bare_ampersand_and_keeps_it_literal
Bare & round-tripserializes back as &amp;recover_bare_ampersand_serializes_back_as_amp_entity
Unclosed element at EOFsynthetic close inserted, one error per levelrecover_logs_one_error_per_unclosed_level
Unclosed → serializeoutput is well-formedrecover_unclosed_serializes_to_closed_xml
Combined bare-& + unclosedboth recoveries applied in source orderrecover_combined_bare_amp_and_unclosed
Bytes input variantsame behaviour from parse_bytes_with_recoveredparse_bytes_with_recovered_handles_bare_amp
Invalid UTF-8fatal — recovery cannot helpparse_bytes_with_recovered_rejects_invalid_utf8
Line/column trackingpreserved in recovered errorsrecovered_error_carries_line_info

From the shell

Terminal window
# Recovery is a top-level CLI flag; works with every subcommand
sup-xml --recover lint broken.xml # exit 0 if anything recovered
sup-xml --recover repair broken.xml -o clean.xml # write the cleaned tree
sup-xml --recover format --pretty broken.xml # pretty-print recovered

The sup-xml repair subcommand is the recovery-mode pipeline in binary form — parse with recovery_mode: true, write the cleaned document back. Useful for one-shot fixes to corrupted feed files.

When NOT to use recovery mode

Recovery mode is for inputs you’ve already decided you’d rather salvage than reject. Don’t reach for it as a default because:

  • It quietly accepts ambiguous input. A typo in your config file that recovery would round into a different tree shouldn’t be silently corrected.
  • It hides upstream bugs. If the upstream is generating malformed XML, the right fix is to fix the upstream, not paper over it.
  • Recovery isn’t deterministic across implementations. SupXML recovers more conservatively than libxml2, but a recovered tree isn’t guaranteed to round-trip to identical output across parsers.

For untrusted input (web request bodies, file uploads), prefer strict parsing + a fallback path — show the user a real error and let them re-submit, rather than silently transforming their input.