HTML5 parsing
SupXML ships an HTML5 parser built on html5ever — Mozilla’s WHATWG-
conformant tokenizer + tree builder, the same code Servo uses. It
produces SupXML’s Document type, so the XPath / XSLT / Schematron /
serializer / serde-de paths work on HTML inputs without translation.
Enabling
[dependencies]sup-xml = { version = "*", features = ["html"] }Parse a document
use sup_xml::{parse_html_str, parse_html_bytes};
let doc = parse_html_str(r#"<!doctype html><html><body><p>hi</p></body></html>"#)?;
// From raw bytes (encoding auto-detected via WHATWG sniff)let doc = parse_html_bytes(include_bytes!("page.html"))?;parse_html_str is lenient by default — it’ll close unclosed tags,
infer <html> / <body> wrappers, and recover from malformed markup
the way a browser does. To insist on strict tokenisation, use
parse_html_str_opts with recovery_mode: false:
use sup_xml::{parse_html_str_opts, HtmlParseOptions};
let strict = HtmlParseOptions { recovery_mode: false, ..Default::default() };let doc = parse_html_str_opts( "<!doctype html><html><body><p>hi</p></body></html>", &strict)?;Strict mode is for confirming input is already a complete, well-formed
document: it requires a <!doctype> and rejects bare fragments
(<p>hi</p> on its own errors). Reach for the default lenient parser
when you’re cleaning up real-world or partial HTML.
Query and modify like XML
use sup_xml::XPathContext;
let doc = parse_html_str(html)?;let ctx = XPathContext::new(&doc);
// Same XPath surface as XML.let titles = ctx.eval_strings("//h2/text()")?;for t in titles { println!("{t}"); }XPath axes, predicates, the EXSLT function library, and namespace
handling all work identically. The DOM is the same Document, so
serialize_to_string round-trips back to XHTML-shaped output.
Streaming SAX-style parse
For documents that don’t fit in memory, HtmlSaxParser runs the
tokenizer + tree builder against an incremental byte feed and emits
events through your HtmlSaxHandler:
use sup_xml::{HtmlSaxParser, HtmlSaxHandler, HtmlAttrs};
struct Counter { p: usize }impl HtmlSaxHandler for Counter { // Every method has a default no-op impl — override only what you need. fn start_element(&mut self, name: &str, _attrs: HtmlAttrs<'_>) { if name == "p" { self.p += 1; } }}
let mut parser = HtmlSaxParser::new(Counter { p: 0 });parser.feed("<html><body><p>one</p><p>two</p></body></html>")?;let counter = parser.finish()?; // `finish` consumes the parser and returns the handlerassert_eq!(counter.p, 2);Maps onto libxml2’s htmlSAXParseChunk and lxml’s
HTMLParser(target=...). The handler API is push-based, so it’s safe
under streaming and async runtimes alike.
What’s covered, what isn’t
✅ Full WHATWG HTML5 tokenizer + tree builder via html5ever.
✅ Browser-equivalent recovery on malformed input.
✅ Encoding sniff (UTF-8, UTF-16, declared <meta charset>, plus the
WHATWG fall-backs via encoding_rs).
✅ Boolean attributes, void elements, raw <script> / <style>
content, attribute-value unquoting.
❌ Encoding re-detection mid-parse — WHATWG allows the parser to detect
a <meta> later in the document and restart with a different
encoding; we commit to the sniffed encoding at the first byte.
❌ Pretty-printed HTML output — printer is compact; HTML pretty-printing
needs block-vs-inline awareness that’s not in v1.
❌ parse_html_fragment (always wraps in implicit <html>/<body>) —
the fragment-context entry point is on the roadmap.
Performance
Median throughput ~1.02× faster than libxml2’s HTML parser and ~1.01× of html5ever at the matched-contract head-to-head on 9 real-world pages. See the performance reference for the per-fixture table.