twig

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 30, 2026 License: MIT Imports: 8 Imported by: 0

README

twig

A Go HTML scraping library with CSS and XPath selectors, fluent traversal, and schema-based extraction.

Installation

go get github.com/your-org/twig

Quick Start

page, err := gosoup.NewPage(rawHTML, "https://example.com")
if err != nil {
    log.Fatal(err)
}

title, err := page.CSS("h1").Text()
href, err := page.CSS("a.main-link").Href()

Querying

Both Page and Builder support CSS and XPath queries. Queries return a *Builder (single match) or Nodes (multiple matches).

// Single match
b := page.CSS("div.article")
b := page.XPath("//div[@class='article']")

// All matches
ns := page.CSSAll("ul li")
ns := page.XPathAll("//ul/li")

// Scoped to a node
b := page.CSS("div.article").CSS("h2")

XPath note: Use relative expressions (.//...) to scope to the current node. Absolute expressions (//...) always search from the document root.

Builder — Single Node

b.Text()           // inner text (all descendants)
b.HTML()           // outer HTML
b.Attr("data-id")  // attribute value
b.Href()           // resolved href URL
b.Src()            // resolved src URL
b.Tag()            // tag name
b.HasClass("foo")  // bool
b.NodeType()       // html.NodeType
Traversal
b.Parent()
b.Child()       // first element child
b.Child(2)      // nth element child (0-indexed)
b.Next()        // next element sibling
b.Prev()        // previous element sibling
b.Children()    // → Nodes
b.Siblings()    // → Nodes
b.Parents()     // → Nodes
b.NextAll()     // → Nodes
b.PrevAll()     // → Nodes

Nodes — Multiple Nodes

ns.First()         // → *Builder
ns.Last()          // → *Builder
ns.Get(i)          // → *Builder
ns.Len()           // (int, error)
ns.Each(func(i int, b *Builder) { ... })

ns.TextAll()       // []string
ns.AttrAll("href") // []string

// Chain further queries
ns.CSS("span")
ns.XPath(".//span")

// Predicates
ns.Some()                           // any nodes?
ns.None()                           // no nodes?
ns.Any(func(b *Builder) bool { })
ns.Every(func(b *Builder) bool { })

Schema Extraction

Extract structured data from a node or page in one call. Partial results are always returned even when some fields fail.

schema := twig.Schema{
    "title": twig.Field().CSS("h1").Text(),
    "link":  twig.Field().CSS("a").Href(),
    "id":    twig.Field().CSS("article").Attr("data-id"),
}

result, err := page.Extract(schema)
// result["title"], result["link"], result["id"]
Per-item extraction from a list
results, err := page.CSSAll("ul li").ExtractAll(schema)
Handling partial errors
result, err := page.Extract(schema)
if err != nil {
    if ee, ok := err.(*twig.ExtractionError); ok {
        for field, ferr := range ee.Fields {
            log.Printf("field %s failed: %v", field, ferr)
        }
    }
}
// result still contains successfully extracted fields

Errors

Error Meaning
ErrNotFound Selector matched nothing
ErrInvalidSelector Bad CSS or XPath expression
ErrParseFailed HTML parse failure
ErrIndexOutOfBounds Index out of range
*ExtractionError Per-field errors from Extract / ExtractAll

Errors propagate lazily through builder chains — you only need to check at the terminal call.

Utilities

// Panics on error — useful in tests and scripts
page := twig.Must(twig.NewPage(rawHTML, ""))
text := twig.Must(page.CSS("h1").Text())

Dependencies

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrNotFound         = errors.New("twig: selector matched no elements")
	ErrInvalidSelector  = errors.New("twig: invalid selector")
	ErrParseFailed      = errors.New("twig: failed to parse HTML")
	ErrIndexOutOfBounds = errors.New("twig: index out of bounds")
)

Functions

func Must

func Must[T any](v T, err error) T

Must panics on error. Useful in scripts and tests.

Types

type Builder

type Builder struct {
	// contains filtered or unexported fields
}

Builder wraps a single node with deferred error propagation.

func (*Builder) Attr

func (b *Builder) Attr(name string) (string, error)

func (*Builder) CSS

func (b *Builder) CSS(selector string) *Builder

CSS searches within this builder's node.

func (*Builder) CSSAll

func (b *Builder) CSSAll(selector string) Nodes

CSSAll searches within this builder's node, returning all matches.

func (*Builder) Child

func (b *Builder) Child(index ...int) *Builder

Child returns the first element child (no args) or the nth element child (one arg). Skips text and comment nodes.

func (*Builder) Children

func (b *Builder) Children() Nodes

func (*Builder) Extract

func (b *Builder) Extract(schema Schema) (Result, error)

func (*Builder) HTML

func (b *Builder) HTML() (string, error)

func (*Builder) HasClass

func (b *Builder) HasClass(name string) (bool, error)

func (*Builder) Href

func (b *Builder) Href() (string, error)

func (*Builder) Next

func (b *Builder) Next() *Builder

func (*Builder) NextAll

func (b *Builder) NextAll() Nodes

func (*Builder) NodeType

func (b *Builder) NodeType() (html.NodeType, error)

func (*Builder) Parent

func (b *Builder) Parent() *Builder

func (*Builder) Parents

func (b *Builder) Parents() Nodes

func (*Builder) Prev

func (b *Builder) Prev() *Builder

func (*Builder) PrevAll

func (b *Builder) PrevAll() Nodes

func (*Builder) Siblings

func (b *Builder) Siblings() Nodes

Siblings returns all sibling elements, excluding self.

func (*Builder) Src

func (b *Builder) Src() (string, error)

func (*Builder) Tag

func (b *Builder) Tag() (string, error)

func (*Builder) Text

func (b *Builder) Text() (string, error)

func (*Builder) XPath

func (b *Builder) XPath(expr string) *Builder

XPath searches within this builder's node. Use relative XPath (.//...) to stay scoped; absolute XPath (//...) searches from document root.

func (*Builder) XPathAll

func (b *Builder) XPathAll(expr string) Nodes

XPathAll searches within this builder's node, returning all matches. Use relative XPath (.//...) to stay scoped; absolute XPath (//...) searches from document root.

type ExtractionError

type ExtractionError struct {
	Fields map[string]error
}

ExtractionError holds per-field errors from Extract or ExtractAll. The partial Result is still returned alongside this error.

func (*ExtractionError) Error

func (e *ExtractionError) Error() string

type ExtractorBuilder

type ExtractorBuilder struct {
	// contains filtered or unexported fields
}

ExtractorBuilder accumulates a selector chain before a string-producing terminal.

func Field

func Field() *ExtractorBuilder

Field starts a chain rooted at the current node.

func (*ExtractorBuilder) Attr

func (eb *ExtractorBuilder) Attr(name string) *StringBuilder

func (*ExtractorBuilder) CSS

func (*ExtractorBuilder) Child

func (eb *ExtractorBuilder) Child(index ...int) *ExtractorBuilder

func (*ExtractorBuilder) HTML

func (eb *ExtractorBuilder) HTML() *StringBuilder

func (*ExtractorBuilder) Href

func (eb *ExtractorBuilder) Href() *StringBuilder

func (*ExtractorBuilder) Next

func (eb *ExtractorBuilder) Next() *ExtractorBuilder

func (*ExtractorBuilder) Parent

func (eb *ExtractorBuilder) Parent() *ExtractorBuilder

func (*ExtractorBuilder) Prev

func (eb *ExtractorBuilder) Prev() *ExtractorBuilder

func (*ExtractorBuilder) Src

func (eb *ExtractorBuilder) Src() *StringBuilder

func (*ExtractorBuilder) Text

func (eb *ExtractorBuilder) Text() *StringBuilder

func (*ExtractorBuilder) XPath

func (eb *ExtractorBuilder) XPath(expr string) *ExtractorBuilder

type Nodes

type Nodes struct {
	// contains filtered or unexported fields
}

Nodes wraps a collection of nodes with deferred error propagation.

func (Nodes) Any

func (ns Nodes) Any(fn func(*Builder) bool) (bool, error)

func (Nodes) AttrAll

func (ns Nodes) AttrAll(name string) ([]string, error)

func (Nodes) CSS

func (ns Nodes) CSS(selector string) Nodes

func (Nodes) Each

func (ns Nodes) Each(fn func(i int, b *Builder))

func (Nodes) Every

func (ns Nodes) Every(fn func(*Builder) bool) (bool, error)

func (Nodes) ExtractAll

func (ns Nodes) ExtractAll(schema Schema) ([]Result, error)

func (Nodes) Filter added in v0.2.0

func (ns Nodes) Filter(fn func(*Builder) bool) Nodes

Filter returns a new Nodes containing only elements for which fn returns true.

func (Nodes) First

func (ns Nodes) First() *Builder

func (Nodes) Get

func (ns Nodes) Get(index int) *Builder

func (Nodes) Last

func (ns Nodes) Last() *Builder

func (Nodes) Len

func (ns Nodes) Len() (int, error)

func (Nodes) None

func (ns Nodes) None() (bool, error)

func (Nodes) Some

func (ns Nodes) Some() (bool, error)

func (Nodes) TextAll

func (ns Nodes) TextAll() ([]string, error)

func (Nodes) XPath

func (ns Nodes) XPath(expr string) Nodes

type Page

type Page struct {
	// contains filtered or unexported fields
}

Page is the document entry point.

func NewPage

func NewPage(rawHTML string, baseURL string) (*Page, error)

NewPage parses an HTML string. baseURL may be empty.

func (*Page) Body

func (p *Page) Body() *Builder

func (*Page) CSS

func (p *Page) CSS(selector string) *Builder

func (*Page) CSSAll

func (p *Page) CSSAll(selector string) Nodes

func (*Page) Extract

func (p *Page) Extract(schema Schema) (Result, error)

func (*Page) Head

func (p *Page) Head() *Builder

func (*Page) XPath

func (p *Page) XPath(expr string) *Builder

func (*Page) XPathAll

func (p *Page) XPathAll(expr string) Nodes

type Result

type Result map[string]string

Result is the output of Extract.

type Schema

type Schema map[string]*StringBuilder

Schema maps keys to StringBuilders evaluated per-node at extract time.

type StringBuilder added in v0.2.0

type StringBuilder struct {
	// contains filtered or unexported fields
}

StringBuilder continues the chain after a string-producing step. Transforms are applied in order after extraction.

func (*StringBuilder) Map added in v0.2.0

func (sb *StringBuilder) Map(fn func(string) string) *StringBuilder

func (*StringBuilder) Replace added in v0.2.0

func (sb *StringBuilder) Replace(old, new string) *StringBuilder

func (*StringBuilder) TrimSpace added in v0.2.0

func (sb *StringBuilder) TrimSpace() *StringBuilder

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL