textproc

package module
v0.4.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 11, 2026 License: BSD-2-Clause Imports: 16 Imported by: 2

README

Text processing

Extract information from text and HTML.

Useful for processing data from web scrapers.

Functions

HTML content extraction: extract visible text, hyperlinks, or image URLs from a parsed HTML tree. Key function: HTMLGetText.

HTML parsing and querying: parse HTML from string, []byte, or io.Reader; query nodes with XPath. Key functions: HTMLXPath.

Text cleaning: normalize Unicode representations, remove redundant whitespace, strip Vietnamese diacritics.

Text analysis: split text into words, build n-gram frequency maps, hash text to a 64-bit integer.

Example usage

A typical web scraping workflow:

  • Download the page
  • Parse a downloaded page with HTMLParseToNode
  • Extract visible text with HTMLGetText
  • Collect outgoing links with HTMLGetHREFs
  • Retrieve image URLs with HTMLGetImgSrc

See html_test.go TestHtmlUtils for a full worked example.

For text processing (n-grams, Vietnamese diacritic removal, word splitting), see text_test.go.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	AlphaNumeric = toMapRunes(numerics + lowerAlphas + upperAlphas)

	AlphaNumericList   = []rune(numerics + lowerAlphas + upperAlphas)
	AlphaNumericEnList = []rune(
		"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
	AlphaEnList = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
)

runes grouped by type, used for character type checking (Vietnamese alphabet)

Functions

func CheckValidXPath

func CheckValidXPath(xPath string) error

CheckValidXPath returns nil if the input XPath is valid.

func GenRandomVarName

func GenRandomVarName(wordLen int) string

GenRandomVarName returns an alphanumeric string whose first character is a letter

func GenRandomWord

func GenRandomWord(minLen int, maxLen int, charList []rune) string

func HTMLGetHREFs

func HTMLGetHREFs(baseUrlStr string, node *html.Node) []string

HTMLGetHREFs returns all URLs in the HTML as absolute URLs, URLs with different fragments are treated as one URL. Silently ignores an invalid baseUrlStr.

func HTMLGetImgSrc

func HTMLGetImgSrc(baseUrlStr string, imgNode *html.Node) string

HTMLGetImgSrc returns the absolute URL of the image

func HTMLGetText

func HTMLGetText(node *html.Node) string

HTMLGetText returns all text in the HTML. Result does not contain JavaScript code or text generated by JavaScript. This function is slow due to recursive tree traversal and multiple text passes.

func HTMLParseToNode

func HTMLParseToNode(htmlContent any) *html.Node

HTMLParseToNode parses HTML content (string, []byte, or io.Reader) into an html.Node (returns an empty node on error). Should only be used for convenient testing: it ignores parse errors silently.

func HTMLRender

func HTMLRender(node *html.Node) string

HTMLRender is a convenience function to render an HTML node to a string

func HTMLRenderIndent

func HTMLRenderIndent(node *html.Node, prefix string, indent string) string

HTMLRenderIndent renders an HTML node to an indented string. Each element begins on a new line with "prefix" plus copies of "indent" for nesting depth. Output is XML-serialized (e.g. <br> becomes <br></br>): visual content is preserved but the node structure may differ if reparsed as HTML.

func HTMLXPath

func HTMLXPath(htmlTree *html.Node, xPath string) ([]*html.Node, error)

HTMLXPath finds all HTML nodes matching the XPath query

func HashTextToInt

func HashTextToInt(word string) int64

HashTextToInt is a fast, low-collision hash function

func NormalizeText

func NormalizeText(text string) string

NormalizeText normalizes different Unicode representations of the same string. For example, "é" can be a single rune ("\u00e9") or "e" followed by an acute accent ("e\u0301"), both should be treated as equal in text processing. Vietnamese text has an extra problem: diacritic position, e.g. old style: òa, óa, ỏa, õa, ọa; new style: oà, oá, oả, oã, oạ

func RemoveRedundantSpace

func RemoveRedundantSpace(text string) string

RemoveRedundantSpace replaces consecutive whitespace with a single space

func RemoveVietnamDiacritic

func RemoveVietnamDiacritic(text string) string

Example: Đào => Dao

func TextToNGrams

func TextToNGrams(text string, n int) map[string]int

TextToNGrams creates a set of n-grams (lowercase) from input text

func TextToWords

func TextToWords(text string) []string

TextToWords splits text into a list of words, with punctuation removed

func WordsToNGrams

func WordsToNGrams(words []string, n int) map[string]int

WordsToNGrams creates a set of n-grams from input words. An n-gram is a contiguous sequence of n words.

Types

This section is empty.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL