The html package provides a robust HTML 4.01 parser that produces a Helium DOM tree utilizing the same core structures as the XML parser. It addresses the idiosyncrasies of HTML parsing (including tag omission, entity references, and implicit element closure) while maintaining compatibility with Helium’s SAX-based architecture and DOM interfaces. This allows developers to apply XPath and XSLT capabilities seamlessly to HTML content.
The HTML parser is part of the Processing Layer within Helium's modular architecture, designed for parsing and tree building. The parser:
io.Reader) into a helium.Document with HTML-specific typings.This integration enables complexity such as auto-closing elements and entity substitution while still producing node trees that conform to Helium’s common Node interface hierarchy.
This flow highlights the event emission during HTML parsing and subsequent DOM tree construction which shares infrastructure with the XML parser pipeline.
Sources: `.claude/docs/parser-internals.md10-11 `.claude/docs/parser-internals.md21
Helium models HTML documents distinctly to align with HTML semantics while reusing the core DOM abstractions.
| Type | Implementation | Description |
|---|---|---|
HTMLDocumentNode | Specialized Document subtype | Root node type representing an HTML document tree. Flags and behavior reflect HTML 4.01 specifics. |
Node | Generic node interface | HTML elements, attributes, text, and comments all implement helium.Node, enabling uniform traversal and manipulation. |
This typing enables queries, transformations, and serializations to behave correctly given the characteristics of HTML input.
Sources: `.claude/docs/libxml2-parity.md103
The HTML parser employs a parserCtx structure acting as a finite-state machine (FSM) driving the stateful lexical and syntactic analysis of HTML input. Parsing key points:
internal/strcursor.StrCursor for efficient navigation and context management.StartElement, EndElement, Characters, etc.) routed to a sax.SAX2Handler.PushParser, enabling incremental parsing in streaming scenarios.| Function or Type | Purpose |
|---|---|
Parse(ctx, []byte) | Entry point for HTML document parsing returning a DOM. |
ParseReader(ctx, io.Reader) | Parses HTML data streamed from an io.Reader source. |
NewPushParser(ctx) | Creates a PushParser for incremental chunk parsing. |
parserCtx | Internal parsing context with current state and buffers. |
PushParser | Wraps parserCtx to parse asynchronously with chunk input. |
Sources: `.claude/docs/parser-internals.md10-11 `.claude/docs/libxml2-parity.md107
helium.TreeBuilder. , ©) and resolves them during parse.ParseChunk() and finalizing with Finish().Sources: `.claude/docs/libxml2-parity.md103 `.claude/docs/parser-internals.md69-70
Error reporting in the HTML parser closely follows libxml2 conventions to meet golden output compatibility:
Parsing errors are presented with precise location and context highlighting:
{filename}:{line}: HTML parser error : {message}
{source line snippet}
{spaces up to error column}^{caret}
Example:
./test/HTML/example.html:10: HTML parser error : Unexpected end tag : div
<p>Some invalid html</div>
^
StartElement) are handled through handleSAXErr `.claude/docs/error-formatting.md101Warning(err)) allowing continued parsing `.claude/docs/error-formatting.md103-104Strict(true)), errors cause termination via parser.fatalSAXErr after parser stabilizes `.claude/docs/error-formatting.md105xmlParserInputGetWindow `.claude/docs/error-formatting.md91-92"HTML parser error" to distinguish from other parser domains.Sources: `.claude/docs/error-formatting.md82-106
The package supports parsing streamed or chunked HTML input via the PushParser. Internally, it creates its own parserCtx and runs parsing in a separate goroutine.
This model supports parsing large or network streaming HTML in an incremental, non-blocking manner.
Sources: `.claude/docs/parser-internals.md10-11 `.claude/docs/libxml2-parity.md107
These examples demonstrate different ways of ingesting HTML content for DOM construction.
Sources: `.claude/docs/parser-internals.md5-6 `.claude/docs/parser-internals.md10-11
The html package plays a critical role in the Helium ecosystem by parsing HTML 4.01 documents into a fully navigable and transformable Helium DOM. It maintains fidelity to HTML parsing rules while leveraging Helium’s SAX and DOM infrastructure for consistency with XML processing. Its push parser model and libxml2-compatible error reporting make it suitable for demanding and production-level HTML processing situations.
Sources:
Refresh this wiki
This wiki was recently refreshed. Please wait 1 day to refresh again.