Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CreateReadableDocument ¶
func CreateReadableDocument(extract *ExtractResult) *html.Node
CreateReadableComponent is helper function to convert the extract result to a single HTML document complete with its metadata and comment (if it exists).
Types ¶
type Config ¶
type Config struct {
// Deduplication config
CacheSize int
MaxDuplicateCount int
MinDuplicateCheckSize int
// Extraction size setting
MinExtractedSize int
MinExtractedCommentSize int
MinOutputSize int
MinOutputCommentSize int
}
Config is advanced setting to fine tune the extraction result. You can use it to specify the minimal size of the extracted content and how many duplicate text allowed. However, for most of the time the default config should be good enough.
func DefaultConfig ¶
func DefaultConfig() *Config
DefaultConfig returns the default configuration value.
type ExtractResult ¶
type ExtractResult struct {
// ContentNode is the extracted content as a `html.Node`.
ContentNode *html.Node
// CommentsNode is the extracted comments as a `html.Node`.
// Will be nil if `ExcludeComments` in `Options` is set to true.
CommentsNode *html.Node
// ContentText is the extracted content as a plain text.
ContentText string
// CommentsText is the extracted comments as a plain text.
// Will be empty if `ExcludeComments` in `Options` is set to true.
CommentsText string
// Metadata is the extracted metadata which taken from several sources i.e.
// <meta> tags, JSON+LD and OpenGraph scheme.
Metadata Metadata
}
ExtractResult is the result of content extraction.
func Extract ¶
func Extract(r io.Reader, opts Options) (*ExtractResult, error)
Extract parses a reader and find the main readable content.
func ExtractDocument ¶
func ExtractDocument(doc *html.Node, opts Options) (*ExtractResult, error)
ExtractDocument parses the specified document and find the main readable content.
type FallbackConfig ¶
type FallbackConfig struct {
//readability
HasReadability bool
ReadabilityFallback *html.Node
HasDistiller bool
DistillerFallback *html.Node
//other fallbacks are possible as well: if set the above four settings are ignored
OtherFallbacks []*html.Node
}
FallbackCandidates allows to specify a list of fallback candidates in particular: readability and domdistiller
type Metadata ¶
type Metadata struct {
Title string
Author string
URL string
Hostname string
Description string
Sitename string
Date time.Time
Categories []string
Tags []string
ID string
Fingerprint string
License string
Language string
Image string
PageType string
}
Metadata is the metadata of the page.
type Options ¶
type Options struct {
// Config is the advanced configuration to fine tune the
// extraction result. Keep it as nil to use default config.
Config *Config
// OriginalURL is the original URL of the page. Might be overwritten by URL in metadata.
OriginalURL *nurl.URL
// TargetLanguage is ISO 639-1 language code to make the extractor only process web page that
// uses the specified language.
TargetLanguage string
// If FallbackCandidates is nil then no fallback will be performed`
// Otherwise: readability and domdistiller fallbacks will be used if precalculated
// OtherFallbacks!=nil will ensure that this list is used (rather than readability/distiller)
FallbackCandidates *FallbackConfig
// FavorPrecision specify whether to prefer less text but correct extraction.
FavorPrecision bool
// FavorRecall specify whether to prefer more text even when unsure.
FavorRecall bool
// ExcludeComments specify whether to exclude comments from the extraction result.
ExcludeComments bool
// ExcludeTables specify whether to exclude information within the HTML <table> element.
ExcludeTables bool
// IncludeImages specify whether the extraction result will include images (experimental).
IncludeImages bool
// IncludeLinks specify whether the extraction result will include links along with their
// targets (experimental).
IncludeLinks bool
// BlacklistedAuthors is list of author names to be excluded from extraction result.
BlacklistedAuthors []string
// Deduplicate specify whether to remove duplicate segments and sections.
Deduplicate bool
// HasEssentialMetadata make the extractor only keep documents featuring all essential
// metadata (date, title, url).
HasEssentialMetadata bool
// MaxTreeSize specify max number of elements inside a document.
// Document that surpass this value will be discarded.
MaxTreeSize int
// EnableLog specify whether log should be enabled or not.
EnableLog bool
//HtmlDateOverride uses pre-extracted date from `htmldate` package
HtmlDateOverride *htmldate.Result
// HtmlDateOptions is configuration for the external `htmldate` package that used to look
// for publish date of a web page.
HtmlDateOptions *htmldate.Options
}
Options is configuration for the extractor.
type SchemaData ¶
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
go-trafilatura
command
|
|
|
examples
|
|
|
chained
command
|
|
|
from-file
command
|
|
|
from-url
command
|
|
|
internal
|
|
|
scripts
|
|
|
comparison
command
|