readability

package module
v0.0.0-...-96dfdda Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 11, 2026 License: MIT Imports: 21 Imported by: 0

README

Go-Readability Go Reference

Go-Readability is a Go package that find the main readable content and the metadata from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This is a fork of github.com/go-shiori/go-readability originally written by Radhi Fadlillah and maintained by Felipe Martin and GitHub contributors. For more information about the changes in this fork, see FORK.md.

Radhi Fadlillah initially ported Readability.js line-by-line to Go to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

This module is compatible with Readability.js v0.6.0.

Installation

Note: you are viewing documentation for version 0, which is API-compatible with github.com/go-shiori/go-readability. The development of this project continues in the v2 branch, which you should choose for best speed and memory efficiency, with API-breaking changes being that some Article fields were converted to methods.

To add this package to your project, use go get:

go get -u codeberg.org/readeck/go-readability

And to get the v2 branch instead:

go get -u codeberg.org/readeck/go-readability/v2

Example

package main

import (
	"fmt"
	"log"
	"os"

	readability "codeberg.org/readeck/go-readability"
)

func main() {
	srcFile, err := os.Open("index.html")
	if err != nil {
		log.Fatal(err)
	}
	defer srcFile.Close()

	baseURL, _ := url.Parse("https://example.com/path/to/article")
	article, err := readability.FromReader(srcFile, baseURL)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Found article with title %q\n\n", article.Title)
	// Print the parsed, cleaned-up HTML markup of the article.
	fmt.Println(article.Content)
}

Command Line Usage

You can also use go-readability as command-line tool:

go install codeberg.org/readeck/go-readability/cmd/go-readability@latest

Now you can use it by running go-readability in your terminal :

$ go-readability -h

go-readability is a parser that extracts article contents from a web page.
The source can be a URL or a filesystem path to a HTML file.
Pass "-" or no argument to read the HTML document from standard input.
Use "--http :0" to automatically choose an available port for the HTTP server.

Usage:
  go-readability [<flags>...] [<url> | <file> | -]

Flags:
  -h, --help          help for go-readability
  -l, --http string   start the http server at the specified address
  -m, --metadata      only print the page's metadata
  -t, --text          only print the page's text

Documentation

Overview

Package readability is a Go package that find the main readable content from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This package is based from Readability.js by Mozilla, and written line by line to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	RxUnlikelyCandidates   = regexp.MustCompile(`(?i)-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|header|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote`)
	RxOkMaybeItsACandidate = regexp.MustCompile(`(?i)and|article|body|column|content|main|mathjax|shadow`)
	RxPositive             = regexp.MustCompile(`(?i)article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story`)
	RxNegative             = regexp.MustCompile(`(?i)-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|footer|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|widget`)
	RxByline               = regexp.MustCompile(`(?i)byline|author|dateline|writtenby|p-author`)
	RxNormalize            = regexp.MustCompile(`(?i)\s{2,}`)
	RxVideos               = regexp.MustCompile(`(?i)//(www\.)?((dailymotion|youtube|youtube-nocookie|player\.vimeo|v\.qq|bilibili|live\.bilibili)\.com|(archive|upload\.wikimedia)\.org|player\.twitch\.tv)`)
	RxTokenize             = regexp.MustCompile(`(?i)\W+`)
	RxWhitespace           = regexp.MustCompile(`(?i)^\s*$`)
	RxHasContent           = regexp.MustCompile(`(?i)\S$`)
	RxHashURL              = regexp.MustCompile(`(?i)^#.+`)
	RxPropertyPattern      = regexp.MustCompile(`(?i)\s*(dc|dcterm|og|article|twitter)\s*:\s*(author|creator|description|title|site_name|published_time|modified_time|image\S*)\s*`)
	RxNamePattern          = regexp.MustCompile(`(?i)^\s*(?:(dc|dcterm|article|og|twitter|parsely|weibo:(article|webpage))\s*[-\.:]\s*)?(author|creator|pub-date|description|title|site_name|published_time|modified_time|image)\s*$`)
	RxTitleSeparator       = regexp.MustCompile(`(?i) [\|\-–—\\/>»] `)
	RxTitleHierarchySep    = regexp.MustCompile(`(?i) [\\/>»] `)
	RxTitleRemoveFinalPart = regexp.MustCompile(`(?i)(.*)[\|\-–—\\/>»] .*`)
	RxTitleRemove1stPart   = regexp.MustCompile(`(?i)[^\|\-–—\\/>»]*[\|\-–—\\/>»](.*)`)
	RxTitleAnySeparator    = regexp.MustCompile(`(?i)[\|\-–—\\/>»]+`)
	RxDisplayNone          = regexp.MustCompile(`(?i)display\s*:\s*none`)
	RxVisibilityHidden     = regexp.MustCompile(`(?i)visibility\s*:\s*hidden`)
	RxSentencePeriod       = regexp.MustCompile(`(?i)\.( |$)`)
	RxShareElements        = regexp.MustCompile(`(?i)(\b|_)(share|sharedaddy)(\b|_)`)
	RxFaviconSize          = regexp.MustCompile(`(?i)(\d+)x(\d+)`)
	RxLazyImageSrcset      = regexp.MustCompile(`(?i)\.(jpg|jpeg|png|webp)\s+\d`)
	RxLazyImageSrc         = regexp.MustCompile(`(?i)^\s*\S+\.(jpg|jpeg|png|webp)\S*\s*$`)
	RxImgExtensions        = regexp.MustCompile(`(?i)\.(jpg|jpeg|png|webp)`)
	RxSrcsetURL            = regexp.MustCompile(`(?i)(\S+)(\s+[\d.]+[xw])?(\s*(?:,|$))`)
	RxB64DataURL           = regexp.MustCompile(`(?i)^data:\s*([^\s;,]+)\s*;\s*base64\s*,`)
	RxJsonLdArticleTypes   = regexp.MustCompile(`(?i)^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$`)
	RxCDATA                = regexp.MustCompile(`^\s*<!\[CDATA\[|\]\]>\s*$`)
	RxSchemaOrg            = regexp.MustCompile(`(?i)^https?\:\/\/schema\.org\/?$`)
	// used to see if a node's content matches words commonly used for ad blocks or loading indicators
	RxAdWords      = regexp.MustCompile(`(?i)^(ad(vertising|vertisement)?|pub(licité)?|werb(ung)?|广告|Реклама|Anuncio)$`)
	RxLoadingWords = regexp.MustCompile(`(?i)^((loading|正在加载|Загрузка|chargement|cargando)(…|\.\.\.)?)$`)
)

All of the regular expressions in use within readability. Defined up here so we don't instantiate them repeatedly in loops *.

Functions

func Check

func Check(input io.Reader) bool

Check checks whether the input is readable without parsing the whole thing. It's the wrapper for `Parser.Check()` and useful if you only use the default parser.

func CheckDocument

func CheckDocument(doc *html.Node) bool

CheckDocument checks whether the document is readable without parsing the whole thing. It's the wrapper for `Parser.CheckDocument()` and useful if you only use the default parser.

Types

type Article

type Article struct {
	Title         string
	Byline        string
	Node          *html.Node
	Content       string
	TextContent   string
	Length        int
	Excerpt       string
	SiteName      string
	Image         string
	Favicon       string
	Language      string
	PublishedTime *time.Time
	ModifiedTime  *time.Time
}

Article is the final readable content.

func FromDocument

func FromDocument(doc *html.Node, pageURL *nurl.URL) (Article, error)

FromDocument parses an document and returns the readable content. It's the wrapper or `Parser.ParseDocument()` and useful if you only want to use the default parser.

func FromReader

func FromReader(input io.Reader, pageURL *nurl.URL) (Article, error)

FromReader parses an `io.Reader` and returns the readable content. It's the wrapper or `Parser.Parse()` and useful if you only want to use the default parser.

Example
srcFile, err := os.Open("index.html")
if err != nil {
	log.Fatal(err)
}
defer srcFile.Close()

baseURL, _ := url.Parse("https://example.com/path/to/article")
article, err := FromReader(srcFile, baseURL)
if err != nil {
	log.Fatal(err)
}

fmt.Printf("Found article with title %q\n\n", article.Title)
// Print the parsed, cleaned-up HTML markup of the article.
fmt.Println(article.Content)

func FromURL

func FromURL(pageURL string, timeout time.Duration, requestModifiers ...RequestWith) (Article, error)

FromURL fetch the web page from specified url then parses the response to find the readable content.

type Parser

type Parser struct {
	// MaxElemsToParse is the max number of nodes supported by this
	// parser. Default: 0 (no limit)
	MaxElemsToParse int
	// NTopCandidates is the number of top candidates to consider when
	// analysing how tight the competition is among candidates.
	NTopCandidates int
	// CharThresholds is the default number of chars an article must
	// have in order to return a result
	CharThresholds int
	// ClassesToPreserve are the classes that readability sets itself.
	ClassesToPreserve []string
	// KeepClasses specify whether the classes should be stripped or not.
	KeepClasses bool
	// TagsToScore is element tags to score by default.
	TagsToScore []string
	// Deprecated: opt into printing logs to stderr. Use Logger instead.
	Debug bool
	// The structured logger to write to. The default log is written to io.Discard.
	Logger *slog.Logger
	// DisableJSONLD determines if metadata in JSON+LD will be extracted
	// or not. Default: false.
	DisableJSONLD bool
	// contains filtered or unexported fields
}

Parser is the parser that parses the page to get the readable content.

func NewParser

func NewParser() Parser

NewParser returns new Parser which set up with default value.

func (*Parser) Check

func (ps *Parser) Check(input io.Reader) bool

Check checks whether the input is readable without parsing the whole thing.

func (*Parser) CheckDocument

func (ps *Parser) CheckDocument(doc *html.Node) bool

CheckDocument checks whether the document is readable without parsing the whole thing.

func (*Parser) Parse

func (ps *Parser) Parse(input io.Reader, pageURL *nurl.URL) (Article, error)

Parse parses a reader and find the main readable content.

func (*Parser) ParseAndMutate

func (ps *Parser) ParseAndMutate(doc *html.Node, pageURL *nurl.URL) (Article, error)

ParseAndMutate is like ParseDocument, but mutates doc during parsing.

func (*Parser) ParseDocument

func (ps *Parser) ParseDocument(doc *html.Node, pageURL *nurl.URL) (Article, error)

ParseDocument parses the specified document and find the main readable content.

type RequestWith

type RequestWith func(r *http.Request)

Directories

Path Synopsis
cmd
go-readability command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL