scraper

package module

v0.1.29 Latest Latest Go to latest Published: Apr 4, 2026 License: MIT Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/unluckythoughts/go-scraper

Links

Open Source Insights

README ¶

go-scrapper

A simple, powerful HTML scraper library for Go built on top of Colly and goquery.

Features

🚀 Simple API - Easy-to-use scraper with sensible defaults
🔄 Smart Pagination - Sequential and parallel pagination support
📡 Channel-based Streaming - Memory-efficient result streaming
🔁 Automatic Retries - Exponential backoff with jitter for rate limits (429)
🎯 CSS Selectors - Powerful CSS selector support with attribute extraction
🛠️ Utility Functions - Built-in helpers for text, attributes, integers, and floats
⚙️ Configurable - Custom user agents, domains, and retry settings

Installation

go get github.com/unluckythoughts/go-scrapper

Quick Start

package main

import (
    "fmt"
    "github.com/unluckythoughts/go-scrapper"
)

func main() {
    // Create a scraper with default settings
    s := scraper.NewDefault()
    
    // Scrape HTML from a URL
    html, err := s.ScrapeHTML("https://example.com")
    if err != nil {
        panic(err)
    }
    
    fmt.Println(html)
}

Core Functions

1. ScrapeHTML - Fetch Complete HTML

Fetches the complete HTML content from a URL with automatic retry on rate limiting.

s := scraper.NewDefault()
html, err := s.ScrapeHTML("https://example.com")

Features:

Automatic exponential backoff retry for 429 (Too Many Requests)
Random jitter (0-1s) to prevent thundering herd
Up to 5 retry attempts

2. ScrapeOuterHTML - Extract Elements

Extracts outer HTML of elements matching a CSS selector.

elements, err := s.ScrapeOuterHTML("https://example.com", "div.product")
// Returns: []string containing outer HTML of all matching elements

3. ScrapePaginated - Multi-page Scraping

Scrapes content across multiple pages with sequential or parallel pagination.

// Sequential pagination (follows "next" links)
config := scraper.PaginationConfig{
    NextPageSelector: "a.next[href]",
}
resultsChan, err := s.ScrapePaginated("https://example.com", "div.item", config)

// Process results from channel
for result := range resultsChan {
    if result.Err != nil {
        log.Printf("Error: %v", result.Err)
        continue
    }
    fmt.Println(result.Data)
}

Parallel Pagination:

// Parallel pagination (scrapes all pages simultaneously)
config := scraper.PaginationConfig{
    LastPageSelector:   "span.page-count", // Element containing total pages
    NextPageURLPattern: "/page/::page::/",  // URL pattern with ::page:: placeholder
}
resultsChan, err := s.ScrapePaginated("https://example.com", "div.item", config)

Utility Functions

The library includes utility functions for extracting and parsing data from HTML.

GetOuterHTML

html := "<div><p>Hello</p></div>"
results, _ := scraper.GetOuterHTML(html, "p")
// Returns: ["<p>Hello</p>"]

GetText

// Extract text content
texts, _ := scraper.GetText(html, "p")

// Extract attribute value using selector
links, _ := scraper.GetText(html, "a[href]")

GetTextSingle

// Extract first matching element's text
text, _ := scraper.GetTextSingle(html, "h1")

// Extract first matching element's attribute
link, _ := scraper.GetTextSingle(html, "a[href]")

GetInt & GetFloat

// Extract and parse as integer
quantity, _ := scraper.GetInt(html, "span.quantity")

// Extract and parse as float (handles currency symbols)
price, _ := scraper.GetFloat(html, "span.price") // Cleans: $99.99 -> 99.99

// From attributes
value, _ := scraper.GetInt(html, "input[data-value]")

GetAttrName

// Extract attribute name from selector
attr := scraper.GetAttrName("div[data-id]") // Returns: "data-id"

GetFullURL

// Convert relative URL to absolute
fullURL := scraper.GetFullURL("https://example.com/page", "../other")
// Returns: "https://example.com/other"

Configuration

Custom Scraper Options

opts := scraper.Options{
    UserAgent:           "MyBot/1.0",
    AllowedDomains:      []string{"example.com"},
    MaxDepth:            3,
    Async:               false,
    MaxRetries:          5,
    MaxParallelRequests: 4,
    ForceRod:            false, // Set to true for sites with aggressive bot detection
}
s := scraper.New(opts)

Option Details:

UserAgent - Custom user agent string
AllowedDomains - Restrict scraping to specific domains
MaxDepth - Maximum depth for link following
Async - Enable asynchronous scraping
MaxRetries - Maximum retry attempts for failed requests
MaxParallelRequests - Number of parallel requests (default: 4)
UseCloudflareBypass - Enable TLS/header bypass to avoid triggering challenges (recommended)
ForceRod - Force browser automation for all requests (bypasses Cloudflare, CAPTCHA)

Bot Detection and Cloudflare Bypass

The scraper provides two layers of Cloudflare protection:

1. Cloudflare Bypass (Preventive) - `UseCloudflareBypass`

Uses cloudflare-bp-go to configure proper TLS settings and headers to avoid triggering Cloudflare challenges in the first place.

opts := scraper.Options{
    UseCloudflareBypass: true, // Recommended for Cloudflare-protected sites
    MaxRetries:          5,
}
s := scraper.New(opts)

How it works:

Sets proper TLS configuration (curves, ciphers)
Adds validated HTTP headers (Accept, User-Agent, etc.)
Makes requests look more like a real browser
Fast - no browser needed
Prevents challenges from appearing

2. Rod Browser (Reactive) - `ForceRod`

Uses rod browser automation to solve challenges that are already displayed.

opts := scraper.Options{
    ForceRod:   true, // Launches real browser for each request
    MaxRetries: 3,
}
s := scraper.New(opts)

How it works:

Launches a real Chromium browser
Waits for Cloudflare challenges to auto-solve
Extracts cookies and page content
Slow - real browser overhead
Solves existing challenges

3. Combined Approach (Recommended for Aggressive Protection)

Use both for maximum success rate:

opts := scraper.Options{
    UseCloudflareBypass: true, // First: try to avoid challenges
    ForceRod:            true, // Fallback: solve challenges if they appear
    MaxRetries:          3,
}
s := scraper.New(opts)

When to use what:

Normal sites: No options needed
Light Cloudflare: UseCloudflareBypass: true
Aggressive Cloudflare: UseCloudflareBypass: true + ForceRod: true
Always blocked: ForceRod: true

Automatic fallback: Even without ForceRod, the scraper automatically detects bot challenges and attempts rod-based bypass. ForceRod just skips the initial attempt and goes straight to the browser.

Pagination Configuration

config := scraper.PaginationConfig{
    // For sequential pagination
    NextPageSelector: "a.next[href]", // CSS selector for next page link
    
    // For parallel pagination
    LastPageSelector:   "span.total-pages",  // Element with total page count
    NextPageURLPattern: "/products?page=::page::", // URL pattern
}

CSS Selector Features

The library supports advanced CSS selectors including attribute selectors:

// Basic selectors
"div.product"              // Class selector
"#main"                    // ID selector
"div > p"                  // Direct child
"div p"                    // Descendant

// Attribute selectors (auto-extracts attribute value)
"a[href]"                  // Extract href attribute
"img[src]"                 // Extract src attribute
"input[data-value]"        // Extract data-value attribute
"div[class*='active']"     // Attribute contains value

Error Handling

All functions return errors that can be checked:

html, err := s.ScrapeHTML(url)
if err != nil {
    // Handle error
    log.Printf("Failed to scrape: %v", err)
}

For paginated scraping, errors are sent through the channel:

for result := range resultsChan {
    if result.Err != nil {
        log.Printf("Error: %v", result.Err)
        continue
    }
    // Process result.Data
}

Advanced Examples

Extract Product Data

s := scraper.NewDefault()
html, _ := s.ScrapeHTML("https://shop.example.com/product/123")

// Extract product details
name, _ := scraper.GetTextSingle(html, "h1.product-name")
price, _ := scraper.GetFloat(html, "span.price")
stock, _ := scraper.GetInt(html, "span.stock[data-quantity]")
imageURL, _ := scraper.GetTextSingle(html, "img.product-image[src]")

fmt.Printf("Product: %s, Price: $%.2f, Stock: %d\n", name, price, stock)

Scrape Multiple Pages

config := scraper.PaginationConfig{
    NextPageSelector: "a.pagination-next[href]",
}

resultsChan, err := s.ScrapePaginated(
    "https://blog.example.com",
    "article.post",
    config,
)

if err != nil {
    panic(err)
}

posts := []string{}
for result := range resultsChan {
    if result.Err != nil {
        log.Printf("Error: %v", result.Err)
        continue
    }
    posts = append(posts, result.Data)
}

fmt.Printf("Scraped %d posts\n", len(posts))

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

Built with:

Colly - Fast web scraping framework
goquery - jQuery-like HTML parsing

Documentation ¶

Index ¶

func GetAttrName(selector string) string
func GetBaseURL(fullURL string) string
func GetCurrentURL(fullURL string) string
func GetFloat(htmlText, selector string) (float64, error)
func GetFullURL(baseURL, relativePath string) string
func GetInt(htmlText, selector string) (int, error)
func GetOuterHTML(htmlText, selector string) ([]string, error)
func GetText(htmlText, selector string) ([]string, error)
func GetTextSingle(htmlText, selector string) (string, error)
func GetTime(htmlText, selector, format string) (*time.Time, error)
type ExtractionFunc
type Options
type PaginationConfig
type Result
type Scraper
- func New(opts Options) *Scraper
- func NewDefault() *Scraper

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GetAttrName ¶

func GetAttrName(selector string) string

GetAttrName extracts the attribute name from a CSS selector with attribute selector Returns the attribute name if the selector ends with an attribute selector, empty string otherwise Examples: "div[data-id]" -> "data-id", "input[type='text']" -> "type", "a[href]" -> "href"

func GetBaseURL ¶

func GetBaseURL(fullURL string) string

func GetCurrentURL ¶ added in v0.1.18

func GetCurrentURL(fullURL string) string

GetCurrentURL extracts just the path from a full URL, removing the query parameters and fragments

func GetFloat ¶

func GetFloat(htmlText, selector string) (float64, error)

GetFloat extracts text from the first element matching the selector and converts it to float64 Returns 0.0 if no match found or conversion fails

func GetFullURL ¶

func GetFullURL(baseURL, relativePath string) string

func GetInt ¶

func GetInt(htmlText, selector string) (int, error)

GetInt extracts text from the first element matching the selector and converts it to int Returns 0 if no match found or conversion fails

func GetOuterHTML ¶

func GetOuterHTML(htmlText, selector string) ([]string, error)

GetOuterHTML extracts the outer HTML of elements matching the given CSS selector from HTML text Returns a slice of outer HTML strings for all matching elements

func GetText ¶

func GetText(htmlText, selector string) ([]string, error)

GetText extracts the text content of elements matching the given CSS selector from HTML text Returns a slice of text strings for all matching elements

func GetTextSingle ¶

func GetTextSingle(htmlText, selector string) (string, error)

GetTextSingle extracts the text content of the first element matching the given CSS selector Returns empty string if no match found

func GetTime ¶

func GetTime(htmlText, selector, format string) (*time.Time, error)

GetTime extracts text from the first element matching the selector and returns it as a string This function can be extended to parse dates into specific formats if needed

Types ¶

type ExtractionFunc ¶

type ExtractionFunc func(i int, s *goquery.Selection)

type Options ¶

type Options struct {
	// UserAgent to use for requests
	UserAgent string
	// AllowedDomains restricts scraping to specific domains
	AllowedDomains []string
	// MaxDepth limits how deep the scraper will follow links
	MaxDepth int
	// Async enables asynchronous scraping
	Async bool
	// MaxParallelRequests sets the maximum number of parallel requests
	MaxParallelRequests int
	// MaxRetries specifies the maximum number of retries for requests
	MaxRetries int
	// UseCloudflareBypass enables Cloudflare bypass using proper TLS and headers
	// Helps avoid triggering Cloudflare challenges in the first place
	UseCloudflareBypass bool
	// Logger allows custom logging in debug (optional)
	Logger *zap.Logger
}

Options provides configuration for the Scraper

type PaginationConfig ¶

type PaginationConfig struct {
	// NextPageSelector is the CSS selector for the "next page" link
	// if the selector matches no elements, pagination stops
	NextPageSelector string
	// LastPageSelector is the CSS selector that indicates the last page number
	// pagination is done with incrementing page numbers until this selector value
	// using NextPageURLPattern to construct URLs
	LastPageSelector string
	// NextPageURLPattern is an optional pattern to construct the next page URL by
	// replacing a '::page::' with the page number.
	// This is mandatory if LastPageSelector is used
	NextPageURLPattern string
}

PaginationConfig holds configuration for paginated scraping

type Result ¶

type Result struct {
	Data string
	Err  error
}

type Scraper ¶

type Scraper struct {
	// contains filtered or unexported fields
}

Scraper represents an HTML scraper with configurable options

func New ¶

func New(opts Options) *Scraper

New creates a new Scraper instance with the given options

func NewDefault ¶

func NewDefault() *Scraper

NewDefault creates a new Scraper instance with default options

func (*Scraper) ScrapeHTML ¶

func (s *Scraper) ScrapeHTML(url string) (string, error)

ScrapeHTML fetches and returns the complete HTML content for a given URL Implements exponential backoff retry for 429 (Too Many Requests) status codes Detects bot challenges and uses rod to solve CAPTCHAs and obtain cookies

func (*Scraper) ScrapeOuterHTML ¶

func (s *Scraper) ScrapeOuterHTML(url, selector string) ([]string, error)

ScrapeOuterHTML fetches the outer HTML of elements matching the given CSS selector

func (*Scraper) ScrapePaginated ¶

func (s *Scraper) ScrapePaginated(url, selector string, config PaginationConfig) (<-chan Result, error)

ScrapePaginated scrapes outer HTML of elements matching the selector across multiple pages Returns a read-only channel that streams results as they are scraped, and an error channel for errors

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL