The Helium parser is a high-performance, pure-Go XML parser designed to closely match libxml2 semantics while embracing Go idioms like garbage collection and modern error handling. This page provides a deep technical dive into the core parsing internals, focusing on the central parser context (parserCtx), the detailed parse pipeline stages, optimizations such as UTF-8 fast paths, the entity amplification guard to prevent DoS, support for incremental push parsing, and strategies for error recovery.
parserCtx)At the heart of Helium's parsing lies the parserCtx struct, which encapsulates all mutable state across the duration of a parse invocation. This includes input source tracking, parser state machine status, namespace and element stacks, DTD and entity state, SAX callbacks, and error management.
inputTab (inputStack) — A last-in-first-out (LIFO) stack holding input cursors (ByteCursor or RuneCursor). This layered cursor design allows the parser to transparently handle entity expansion and external DTD loading by pushing new cursors. parserctx.go100getCursor() — Returns the active cursor on top of inputTab, auto-popping exhausted ones and caching the active cursor between parser calls. parserctx.go101The parser advances through distinct states (ParserState) that govern syntactic expectations. Noteworthy states include:
| State | Description |
|---|---|
psStart / psEOF | Initial and terminal states. .claude/docs/parser-internals.md58 |
psPrologue / psEpilogue | Parsing before and after the root element. .claude/docs/parser-internals.md58 |
psContent | Recursive descent content parsing state. .claude/docs/parser-internals.md58 |
psDTD / psEntityDecl | Inside DTD internal or external subset. .claude/docs/parser-internals.md58 |
psAttributeValue | Attribute value parsing; external entity refs forbidden here. .claude/docs/parser-internals.md60 |
nodeTab (nodeStack): Element nesting stack used for recursive parsing. .claude/docs/parser-internals.md63nsTab (nsStack): Prefix-to-URI bindings managed via Push(prefix, uri) and Lookup(prefix). .claude/docs/parser-internals.md64nsNrTab: Parallel array to nodeTab recording namespace declaration counts per element level to pop the exact count on element close. .claude/docs/parser-internals.md65spaceTab: xml:space stack (-1=inherit, 0=default, 1=preserve). .claude/docs/parser-internals.md66attsSpecial: Map of special attributes (e.g., ID types) from DTD. parserctx.go160attsDefault: Default attributes introduced by DTD declarations. parserctx.go161inSubset: Tracks if the parser is in internal (1) or external (2) DTD subsets. parserctx.go162replaceEntities: Flag to expand entity references inline, controlled by SubstituteEntities(true). parserctx.go163fsys: fs.FS used for loading external DTDs/entities, defaulting to PermissiveRoot. parserctx.go164Sources: .claude/docs/parser-internals.md49-89 parserctx.go100-101 parserctx.go160-164
Helium's parsing proceeds through a well-defined pipeline, converting input into a DOM tree while emitting SAX2 events.
Sources: .claude/docs/parser-internals.md23-47
Helium performs multi-step encoding detection in detectEncoding():
< byte patterns (BE/LE/2143/3412) by peeking. .claude/docs/parser-internals.md940x4C 0x6F 0xA7 0x94 invariant prefix. .claude/docs/parser-internals.md950xEF 0xBB 0xBF) and UTF-16 (0xFF 0xFE / 0xFE 0xFF). .claude/docs/parser-internals.md96-97Strict Decoding: Helium uses withStrictDecode in internal/encoding to ensure malformed sequences (like unpaired surrogates) trigger ErrInvalidEncodedChar rather than silent replacement with U+FFFD. .claude/docs/parser-internals.md108-112
Sources: .claude/docs/parser-internals.md91-116
To prevent DoS attacks like the "Billion Laughs," Helium enforces strict amplification limits during entity expansion.
sizeentcopy: Tracks cumulative bytes from entity expansion. .claude/docs/parser-internals.md81maxAmpl: Default factor of 5 (5x expansion relative to input). .claude/docs/parser-internals.md82When parsing nested entities (external or internal), the inheritNestedParserState helper in parser_entity_decl.go ensures that maxElemDepth and the current elemDepth are carried over. This prevents attackers from bypassing depth limits by splitting deep nesting across entity boundaries. .claude/docs/parser-internals.md78
Sources: .claude/docs/parser-internals.md78-85
Helium provides a background push parser via p.NewPushParser(ctx). .claude/docs/parser-internals.md7 It processes data incrementally as it arrives in a background goroutine. Unlike libxml2's non-blocking push API, Helium's implementation blocks on chunk boundaries to manage flow control. .claude/docs/libxml2-parity.md107
RecoverOnError(bool): Corresponds to XML_PARSE_RECOVER. .claude/docs/libxml2-parity.md125recoverErr, disables further SAX callbacks (disableSAX = true) to prevent inconsistent DOM states, and attempts to resume parsing to find subsequent errors. .claude/docs/parser-internals.md87-90This diagram maps the natural language concepts of the parser to their specific implementation entities in the Go codebase.
Sources: .claude/docs/parser-internals.md10-21 .claude/docs/parser-internals.md49-58 parserctx.go100-110
Refresh this wiki
This wiki was recently refreshed. Please wait 1 day to refresh again.