readability

package module

v0.0.0-...-96dfdda Latest Latest Go to latest Published: Mar 11, 2026 License: MIT Imports: 21 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jobindex-open/go-readability

Links

Open Source Insights

README ¶

Go-Readability

Go-Readability is a Go package that find the main readable content and the metadata from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This is a fork of github.com/go-shiori/go-readability originally written by Radhi Fadlillah and maintained by Felipe Martin and GitHub contributors. For more information about the changes in this fork, see FORK.md.

Radhi Fadlillah initially ported Readability.js line-by-line to Go to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

This module is compatible with Readability.js v0.6.0.

Installation

Note: you are viewing documentation for version 0, which is API-compatible with github.com/go-shiori/go-readability. The development of this project continues in the v2 branch, which you should choose for best speed and memory efficiency, with API-breaking changes being that some Article fields were converted to methods.

To add this package to your project, use go get:

go get -u codeberg.org/readeck/go-readability

And to get the v2 branch instead:

go get -u codeberg.org/readeck/go-readability/v2

Example

package main

import (
	"fmt"
	"log"
	"os"

	readability "codeberg.org/readeck/go-readability"
)

func main() {
	srcFile, err := os.Open("index.html")
	if err != nil {
		log.Fatal(err)
	}
	defer srcFile.Close()

	baseURL, _ := url.Parse("https://example.com/path/to/article")
	article, err := readability.FromReader(srcFile, baseURL)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Found article with title %q\n\n", article.Title)
	// Print the parsed, cleaned-up HTML markup of the article.
	fmt.Println(article.Content)
}

Command Line Usage

You can also use go-readability as command-line tool:

go install codeberg.org/readeck/go-readability/cmd/go-readability@latest

Now you can use it by running go-readability in your terminal :

$ go-readability -h

go-readability is a parser that extracts article contents from a web page.
The source can be a URL or a filesystem path to a HTML file.
Pass "-" or no argument to read the HTML document from standard input.
Use "--http :0" to automatically choose an available port for the HTTP server.

Usage:
  go-readability [<flags>...] [<url> | <file> | -]

Flags:
  -h, --help          help for go-readability
  -l, --http string   start the http server at the specified address
  -m, --metadata      only print the page's metadata
  -t, --text          only print the page's text

Documentation ¶

Overview ¶

Package readability is a Go package that find the main readable content from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This package is based from Readability.js by Mozilla, and written line by line to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

Examples ¶

FromReader

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	RxUnlikelyCandidates   = regexp.MustCompile(`(?i)-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|header|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote`)
	RxOkMaybeItsACandidate = regexp.MustCompile(`(?i)and|article|body|column|content|main|mathjax|shadow`)
	RxPositive             = regexp.MustCompile(`(?i)article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story`)
	RxNegative             = regexp.MustCompile(`(?i)-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|footer|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|widget`)
	RxByline               = regexp.MustCompile(`(?i)byline|author|dateline|writtenby|p-author`)
	RxNormalize            = regexp.MustCompile(`(?i)\s{2,}`)
	RxVideos               = regexp.MustCompile(`(?i)//(www\.)?((dailymotion|youtube|youtube-nocookie|player\.vimeo|v\.qq|bilibili|live\.bilibili)\.com|(archive|upload\.wikimedia)\.org|player\.twitch\.tv)`)
	RxTokenize             = regexp.MustCompile(`(?i)\W+`)
	RxWhitespace           = regexp.MustCompile(`(?i)^\s*$`)
	RxHasContent           = regexp.MustCompile(`(?i)\S$`)
	RxHashURL              = regexp.MustCompile(`(?i)^#.+`)
	RxPropertyPattern      = regexp.MustCompile(`(?i)\s*(dc|dcterm|og|article|twitter)\s*:\s*(author|creator|description|title|site_name|published_time|modified_time|image\S*)\s*`)
	RxNamePattern          = regexp.MustCompile(`(?i)^\s*(?:(dc|dcterm|article|og|twitter|parsely|weibo:(article|webpage))\s*[-\.:]\s*)?(author|creator|pub-date|description|title|site_name|published_time|modified_time|image)\s*$`)
	RxTitleSeparator       = regexp.MustCompile(`(?i) [\|\-–—\\/>»] `)
	RxTitleHierarchySep    = regexp.MustCompile(`(?i) [\\/>»] `)
	RxTitleRemoveFinalPart = regexp.MustCompile(`(?i)(.*)[\|\-–—\\/>»] .*`)
	RxTitleRemove1stPart   = regexp.MustCompile(`(?i)[^\|\-–—\\/>»]*[\|\-–—\\/>»](.*)`)
	RxTitleAnySeparator    = regexp.MustCompile(`(?i)[\|\-–—\\/>»]+`)
	RxDisplayNone          = regexp.MustCompile(`(?i)display\s*:\s*none`)
	RxVisibilityHidden     = regexp.MustCompile(`(?i)visibility\s*:\s*hidden`)
	RxSentencePeriod       = regexp.MustCompile(`(?i)\.( |$)`)
	RxShareElements        = regexp.MustCompile(`(?i)(\b|_)(share|sharedaddy)(\b|_)`)
	RxFaviconSize          = regexp.MustCompile(`(?i)(\d+)x(\d+)`)
	RxLazyImageSrcset      = regexp.MustCompile(`(?i)\.(jpg|jpeg|png|webp)\s+\d`)
	RxLazyImageSrc         = regexp.MustCompile(`(?i)^\s*\S+\.(jpg|jpeg|png|webp)\S*\s*$`)
	RxImgExtensions        = regexp.MustCompile(`(?i)\.(jpg|jpeg|png|webp)`)
	RxSrcsetURL            = regexp.MustCompile(`(?i)(\S+)(\s+[\d.]+[xw])?(\s*(?:,|$))`)
	RxB64DataURL           = regexp.MustCompile(`(?i)^data:\s*([^\s;,]+)\s*;\s*base64\s*,`)
	RxJsonLdArticleTypes   = regexp.MustCompile(`(?i)^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$`)
	RxCDATA                = regexp.MustCompile(`^\s*<!\[CDATA\[|\]\]>\s*$`)
	RxSchemaOrg            = regexp.MustCompile(`(?i)^https?\:\/\/schema\.org\/?$`)
	// used to see if a node's content matches words commonly used for ad blocks or loading indicators
	RxAdWords      = regexp.MustCompile(`(?i)^(ad(vertising|vertisement)?|pub(licité)?|werb(ung)?|广告|Реклама|Anuncio)$`)
	RxLoadingWords = regexp.MustCompile(`(?i)^((loading|正在加载|Загрузка|chargement|cargando)(…|\.\.\.)?)$`)
)

All of the regular expressions in use within readability. Defined up here so we don't instantiate them repeatedly in loops *.

Functions ¶

func Check ¶

func Check(input io.Reader) bool

Check checks whether the input is readable without parsing the whole thing. It's the wrapper for `Parser.Check()` and useful if you only use the default parser.

func CheckDocument ¶

func CheckDocument(doc *html.Node) bool

CheckDocument checks whether the document is readable without parsing the whole thing. It's the wrapper for `Parser.CheckDocument()` and useful if you only use the default parser.

Types ¶

type Article ¶

type Article struct {
	Title         string
	Byline        string
	Node          *html.Node
	Content       string
	TextContent   string
	Length        int
	Excerpt       string
	SiteName      string
	Image         string
	Favicon       string
	Language      string
	PublishedTime *time.Time
	ModifiedTime  *time.Time
}

Article is the final readable content.

func FromDocument ¶

func FromDocument(doc *html.Node, pageURL *nurl.URL) (Article, error)

FromDocument parses an document and returns the readable content. It's the wrapper or `Parser.ParseDocument()` and useful if you only want to use the default parser.

func FromReader ¶

func FromReader(input io.Reader, pageURL *nurl.URL) (Article, error)

FromReader parses an `io.Reader` and returns the readable content. It's the wrapper or `Parser.Parse()` and useful if you only want to use the default parser.

Example ¶

srcFile, err := os.Open("index.html")
if err != nil {
	log.Fatal(err)
}
defer srcFile.Close()

baseURL, _ := url.Parse("https://example.com/path/to/article")
article, err := FromReader(srcFile, baseURL)
if err != nil {
	log.Fatal(err)
}

fmt.Printf("Found article with title %q\n\n", article.Title)
// Print the parsed, cleaned-up HTML markup of the article.
fmt.Println(article.Content)

func FromURL ¶

func FromURL(pageURL string, timeout time.Duration, requestModifiers ...RequestWith) (Article, error)

FromURL fetch the web page from specified url then parses the response to find the readable content.

type Parser ¶

type Parser struct {
	// MaxElemsToParse is the max number of nodes supported by this
	// parser. Default: 0 (no limit)
	MaxElemsToParse int
	// NTopCandidates is the number of top candidates to consider when
	// analysing how tight the competition is among candidates.
	NTopCandidates int
	// CharThresholds is the default number of chars an article must
	// have in order to return a result
	CharThresholds int
	// ClassesToPreserve are the classes that readability sets itself.
	ClassesToPreserve []string
	// KeepClasses specify whether the classes should be stripped or not.
	KeepClasses bool
	// TagsToScore is element tags to score by default.
	TagsToScore []string
	// Deprecated: opt into printing logs to stderr. Use Logger instead.
	Debug bool
	// The structured logger to write to. The default log is written to io.Discard.
	Logger *slog.Logger
	// DisableJSONLD determines if metadata in JSON+LD will be extracted
	// or not. Default: false.
	DisableJSONLD bool
	// contains filtered or unexported fields
}

Parser is the parser that parses the page to get the readable content.

func NewParser ¶

func NewParser() Parser

NewParser returns new Parser which set up with default value.

func (*Parser) Check ¶

func (ps *Parser) Check(input io.Reader) bool

Check checks whether the input is readable without parsing the whole thing.

func (*Parser) CheckDocument ¶

func (ps *Parser) CheckDocument(doc *html.Node) bool

CheckDocument checks whether the document is readable without parsing the whole thing.

func (*Parser) Parse ¶

func (ps *Parser) Parse(input io.Reader, pageURL *nurl.URL) (Article, error)

Parse parses a reader and find the main readable content.

func (*Parser) ParseAndMutate ¶

func (ps *Parser) ParseAndMutate(doc *html.Node, pageURL *nurl.URL) (Article, error)

ParseAndMutate is like ParseDocument, but mutates doc during parsing.

func (*Parser) ParseDocument ¶

func (ps *Parser) ParseDocument(doc *html.Node, pageURL *nurl.URL) (Article, error)

ParseDocument parses the specified document and find the main readable content.

type RequestWith ¶

type RequestWith func(r *http.Request)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
go-readability command
render
scripts

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL