scraper

package module
v0.1.29 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 4, 2026 License: MIT Imports: 16 Imported by: 0

README

go-scrapper

A simple, powerful HTML scraper library for Go built on top of Colly and goquery.

Features

  • 🚀 Simple API - Easy-to-use scraper with sensible defaults
  • 🔄 Smart Pagination - Sequential and parallel pagination support
  • 📡 Channel-based Streaming - Memory-efficient result streaming
  • 🔁 Automatic Retries - Exponential backoff with jitter for rate limits (429)
  • 🎯 CSS Selectors - Powerful CSS selector support with attribute extraction
  • 🛠️ Utility Functions - Built-in helpers for text, attributes, integers, and floats
  • ⚙️ Configurable - Custom user agents, domains, and retry settings

Installation

go get github.com/unluckythoughts/go-scrapper

Quick Start

package main

import (
    "fmt"
    "github.com/unluckythoughts/go-scrapper"
)

func main() {
    // Create a scraper with default settings
    s := scraper.NewDefault()
    
    // Scrape HTML from a URL
    html, err := s.ScrapeHTML("https://example.com")
    if err != nil {
        panic(err)
    }
    
    fmt.Println(html)
}

Core Functions

1. ScrapeHTML - Fetch Complete HTML

Fetches the complete HTML content from a URL with automatic retry on rate limiting.

s := scraper.NewDefault()
html, err := s.ScrapeHTML("https://example.com")

Features:

  • Automatic exponential backoff retry for 429 (Too Many Requests)
  • Random jitter (0-1s) to prevent thundering herd
  • Up to 5 retry attempts
2. ScrapeOuterHTML - Extract Elements

Extracts outer HTML of elements matching a CSS selector.

elements, err := s.ScrapeOuterHTML("https://example.com", "div.product")
// Returns: []string containing outer HTML of all matching elements
3. ScrapePaginated - Multi-page Scraping

Scrapes content across multiple pages with sequential or parallel pagination.

// Sequential pagination (follows "next" links)
config := scraper.PaginationConfig{
    NextPageSelector: "a.next[href]",
}
resultsChan, err := s.ScrapePaginated("https://example.com", "div.item", config)

// Process results from channel
for result := range resultsChan {
    if result.Err != nil {
        log.Printf("Error: %v", result.Err)
        continue
    }
    fmt.Println(result.Data)
}

Parallel Pagination:

// Parallel pagination (scrapes all pages simultaneously)
config := scraper.PaginationConfig{
    LastPageSelector:   "span.page-count", // Element containing total pages
    NextPageURLPattern: "/page/::page::/",  // URL pattern with ::page:: placeholder
}
resultsChan, err := s.ScrapePaginated("https://example.com", "div.item", config)

Utility Functions

The library includes utility functions for extracting and parsing data from HTML.

GetOuterHTML
html := "<div><p>Hello</p></div>"
results, _ := scraper.GetOuterHTML(html, "p")
// Returns: ["<p>Hello</p>"]
GetText
// Extract text content
texts, _ := scraper.GetText(html, "p")

// Extract attribute value using selector
links, _ := scraper.GetText(html, "a[href]")
GetTextSingle
// Extract first matching element's text
text, _ := scraper.GetTextSingle(html, "h1")

// Extract first matching element's attribute
link, _ := scraper.GetTextSingle(html, "a[href]")
GetInt & GetFloat
// Extract and parse as integer
quantity, _ := scraper.GetInt(html, "span.quantity")

// Extract and parse as float (handles currency symbols)
price, _ := scraper.GetFloat(html, "span.price") // Cleans: $99.99 -> 99.99

// From attributes
value, _ := scraper.GetInt(html, "input[data-value]")
GetAttrName
// Extract attribute name from selector
attr := scraper.GetAttrName("div[data-id]") // Returns: "data-id"
GetFullURL
// Convert relative URL to absolute
fullURL := scraper.GetFullURL("https://example.com/page", "../other")
// Returns: "https://example.com/other"

Configuration

Custom Scraper Options
opts := scraper.Options{
    UserAgent:           "MyBot/1.0",
    AllowedDomains:      []string{"example.com"},
    MaxDepth:            3,
    Async:               false,
    MaxRetries:          5,
    MaxParallelRequests: 4,
    ForceRod:            false, // Set to true for sites with aggressive bot detection
}
s := scraper.New(opts)

Option Details:

  • UserAgent - Custom user agent string
  • AllowedDomains - Restrict scraping to specific domains
  • MaxDepth - Maximum depth for link following
  • Async - Enable asynchronous scraping
  • MaxRetries - Maximum retry attempts for failed requests
  • MaxParallelRequests - Number of parallel requests (default: 4)
  • UseCloudflareBypass - Enable TLS/header bypass to avoid triggering challenges (recommended)
  • ForceRod - Force browser automation for all requests (bypasses Cloudflare, CAPTCHA)
Bot Detection and Cloudflare Bypass

The scraper provides two layers of Cloudflare protection:

1. Cloudflare Bypass (Preventive) - UseCloudflareBypass

Uses cloudflare-bp-go to configure proper TLS settings and headers to avoid triggering Cloudflare challenges in the first place.

opts := scraper.Options{
    UseCloudflareBypass: true, // Recommended for Cloudflare-protected sites
    MaxRetries:          5,
}
s := scraper.New(opts)

How it works:

  • Sets proper TLS configuration (curves, ciphers)
  • Adds validated HTTP headers (Accept, User-Agent, etc.)
  • Makes requests look more like a real browser
  • Fast - no browser needed
  • Prevents challenges from appearing
2. Rod Browser (Reactive) - ForceRod

Uses rod browser automation to solve challenges that are already displayed.

opts := scraper.Options{
    ForceRod:   true, // Launches real browser for each request
    MaxRetries: 3,
}
s := scraper.New(opts)

How it works:

  • Launches a real Chromium browser
  • Waits for Cloudflare challenges to auto-solve
  • Extracts cookies and page content
  • Slow - real browser overhead
  • Solves existing challenges

Use both for maximum success rate:

opts := scraper.Options{
    UseCloudflareBypass: true, // First: try to avoid challenges
    ForceRod:            true, // Fallback: solve challenges if they appear
    MaxRetries:          3,
}
s := scraper.New(opts)

When to use what:

  • Normal sites: No options needed
  • Light Cloudflare: UseCloudflareBypass: true
  • Aggressive Cloudflare: UseCloudflareBypass: true + ForceRod: true
  • Always blocked: ForceRod: true

Automatic fallback: Even without ForceRod, the scraper automatically detects bot challenges and attempts rod-based bypass. ForceRod just skips the initial attempt and goes straight to the browser.

Pagination Configuration
config := scraper.PaginationConfig{
    // For sequential pagination
    NextPageSelector: "a.next[href]", // CSS selector for next page link
    
    // For parallel pagination
    LastPageSelector:   "span.total-pages",  // Element with total page count
    NextPageURLPattern: "/products?page=::page::", // URL pattern
}

CSS Selector Features

The library supports advanced CSS selectors including attribute selectors:

// Basic selectors
"div.product"              // Class selector
"#main"                    // ID selector
"div > p"                  // Direct child
"div p"                    // Descendant

// Attribute selectors (auto-extracts attribute value)
"a[href]"                  // Extract href attribute
"img[src]"                 // Extract src attribute
"input[data-value]"        // Extract data-value attribute
"div[class*='active']"     // Attribute contains value

Error Handling

All functions return errors that can be checked:

html, err := s.ScrapeHTML(url)
if err != nil {
    // Handle error
    log.Printf("Failed to scrape: %v", err)
}

For paginated scraping, errors are sent through the channel:

for result := range resultsChan {
    if result.Err != nil {
        log.Printf("Error: %v", result.Err)
        continue
    }
    // Process result.Data
}

Advanced Examples

Extract Product Data
s := scraper.NewDefault()
html, _ := s.ScrapeHTML("https://shop.example.com/product/123")

// Extract product details
name, _ := scraper.GetTextSingle(html, "h1.product-name")
price, _ := scraper.GetFloat(html, "span.price")
stock, _ := scraper.GetInt(html, "span.stock[data-quantity]")
imageURL, _ := scraper.GetTextSingle(html, "img.product-image[src]")

fmt.Printf("Product: %s, Price: $%.2f, Stock: %d\n", name, price, stock)
Scrape Multiple Pages
config := scraper.PaginationConfig{
    NextPageSelector: "a.pagination-next[href]",
}

resultsChan, err := s.ScrapePaginated(
    "https://blog.example.com",
    "article.post",
    config,
)

if err != nil {
    panic(err)
}

posts := []string{}
for result := range resultsChan {
    if result.Err != nil {
        log.Printf("Error: %v", result.Err)
        continue
    }
    posts = append(posts, result.Data)
}

fmt.Printf("Scraped %d posts\n", len(posts))

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

Built with:

  • Colly - Fast web scraping framework
  • goquery - jQuery-like HTML parsing

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetAttrName

func GetAttrName(selector string) string

GetAttrName extracts the attribute name from a CSS selector with attribute selector Returns the attribute name if the selector ends with an attribute selector, empty string otherwise Examples: "div[data-id]" -> "data-id", "input[type='text']" -> "type", "a[href]" -> "href"

func GetBaseURL

func GetBaseURL(fullURL string) string

func GetCurrentURL added in v0.1.18

func GetCurrentURL(fullURL string) string

GetCurrentURL extracts just the path from a full URL, removing the query parameters and fragments

func GetFloat

func GetFloat(htmlText, selector string) (float64, error)

GetFloat extracts text from the first element matching the selector and converts it to float64 Returns 0.0 if no match found or conversion fails

func GetFullURL

func GetFullURL(baseURL, relativePath string) string

func GetInt

func GetInt(htmlText, selector string) (int, error)

GetInt extracts text from the first element matching the selector and converts it to int Returns 0 if no match found or conversion fails

func GetOuterHTML

func GetOuterHTML(htmlText, selector string) ([]string, error)

GetOuterHTML extracts the outer HTML of elements matching the given CSS selector from HTML text Returns a slice of outer HTML strings for all matching elements

func GetText

func GetText(htmlText, selector string) ([]string, error)

GetText extracts the text content of elements matching the given CSS selector from HTML text Returns a slice of text strings for all matching elements

func GetTextSingle

func GetTextSingle(htmlText, selector string) (string, error)

GetTextSingle extracts the text content of the first element matching the given CSS selector Returns empty string if no match found

func GetTime

func GetTime(htmlText, selector, format string) (*time.Time, error)

GetTime extracts text from the first element matching the selector and returns it as a string This function can be extended to parse dates into specific formats if needed

Types

type ExtractionFunc

type ExtractionFunc func(i int, s *goquery.Selection)

type Options

type Options struct {
	// UserAgent to use for requests
	UserAgent string
	// AllowedDomains restricts scraping to specific domains
	AllowedDomains []string
	// MaxDepth limits how deep the scraper will follow links
	MaxDepth int
	// Async enables asynchronous scraping
	Async bool
	// MaxParallelRequests sets the maximum number of parallel requests
	MaxParallelRequests int
	// MaxRetries specifies the maximum number of retries for requests
	MaxRetries int
	// UseCloudflareBypass enables Cloudflare bypass using proper TLS and headers
	// Helps avoid triggering Cloudflare challenges in the first place
	UseCloudflareBypass bool
	// Logger allows custom logging in debug (optional)
	Logger *zap.Logger
}

Options provides configuration for the Scraper

type PaginationConfig

type PaginationConfig struct {
	// NextPageSelector is the CSS selector for the "next page" link
	// if the selector matches no elements, pagination stops
	NextPageSelector string
	// LastPageSelector is the CSS selector that indicates the last page number
	// pagination is done with incrementing page numbers until this selector value
	// using NextPageURLPattern to construct URLs
	LastPageSelector string
	// NextPageURLPattern is an optional pattern to construct the next page URL by
	// replacing a '::page::' with the page number.
	// This is mandatory if LastPageSelector is used
	NextPageURLPattern string
}

PaginationConfig holds configuration for paginated scraping

type Result

type Result struct {
	Data string
	Err  error
}

type Scraper

type Scraper struct {
	// contains filtered or unexported fields
}

Scraper represents an HTML scraper with configurable options

func New

func New(opts Options) *Scraper

New creates a new Scraper instance with the given options

func NewDefault

func NewDefault() *Scraper

NewDefault creates a new Scraper instance with default options

func (*Scraper) ScrapeHTML

func (s *Scraper) ScrapeHTML(url string) (string, error)

ScrapeHTML fetches and returns the complete HTML content for a given URL Implements exponential backoff retry for 429 (Too Many Requests) status codes Detects bot challenges and uses rod to solve CAPTCHAs and obtain cookies

func (*Scraper) ScrapeOuterHTML

func (s *Scraper) ScrapeOuterHTML(url, selector string) ([]string, error)

ScrapeOuterHTML fetches the outer HTML of elements matching the given CSS selector

func (*Scraper) ScrapePaginated

func (s *Scraper) ScrapePaginated(url, selector string, config PaginationConfig) (<-chan Result, error)

ScrapePaginated scrapes outer HTML of elements matching the selector across multiple pages Returns a read-only channel that streams results as they are scraped, and an error channel for errors

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL