textproc

package module

v0.4.1 Latest Latest Go to latest Published: Apr 11, 2026 License: BSD-2-Clause Imports: 16 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mywrap/textproc

Links

Open Source Insights

README ¶

Text processing

Extract information from text and HTML.

Useful for processing data from web scrapers.

Functions

HTML content extraction: extract visible text, hyperlinks, or image URLs from a parsed HTML tree. Key function: HTMLGetText.

HTML parsing and querying: parse HTML from string, []byte, or io.Reader; query nodes with XPath. Key functions: HTMLXPath.

Text cleaning: normalize Unicode representations, remove redundant whitespace, strip Vietnamese diacritics.

Text analysis: split text into words, build n-gram frequency maps, hash text to a 64-bit integer.

Example usage

A typical web scraping workflow:

Download the page
Parse a downloaded page with HTMLParseToNode
Extract visible text with HTMLGetText
Collect outgoing links with HTMLGetHREFs
Retrieve image URLs with HTMLGetImgSrc

See html_test.go TestHtmlUtils for a full worked example.

For text processing (n-grams, Vietnamese diacritic removal, word splitting), see text_test.go.

Documentation ¶

Index ¶

Variables
func CheckValidXPath(xPath string) error
func GenRandomVarName(wordLen int) string
func GenRandomWord(minLen int, maxLen int, charList []rune) string
func HTMLGetHREFs(baseUrlStr string, node *html.Node) []string
func HTMLGetImgSrc(baseUrlStr string, imgNode *html.Node) string
func HTMLGetText(node *html.Node) string
func HTMLParseToNode(htmlContent any) *html.Node
func HTMLRender(node *html.Node) string
func HTMLRenderIndent(node *html.Node, prefix string, indent string) string
func HTMLXPath(htmlTree *html.Node, xPath string) ([]*html.Node, error)
func HashTextToInt(word string) int64
func NormalizeText(text string) string
func RemoveRedundantSpace(text string) string
func RemoveVietnamDiacritic(text string) string
func TextToNGrams(text string, n int) map[string]int
func TextToWords(text string) []string
func WordsToNGrams(words []string, n int) map[string]int

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	AlphaNumeric = toMapRunes(numerics + lowerAlphas + upperAlphas)

	AlphaNumericList   = []rune(numerics + lowerAlphas + upperAlphas)
	AlphaNumericEnList = []rune(
		"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
	AlphaEnList = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
)

runes grouped by type, used for character type checking (Vietnamese alphabet)

Functions ¶

func CheckValidXPath ¶

func CheckValidXPath(xPath string) error

CheckValidXPath returns nil if the input XPath is valid.

func GenRandomVarName ¶

func GenRandomVarName(wordLen int) string

GenRandomVarName returns an alphanumeric string whose first character is a letter

func GenRandomWord ¶

func GenRandomWord(minLen int, maxLen int, charList []rune) string

func HTMLGetHREFs ¶

func HTMLGetHREFs(baseUrlStr string, node *html.Node) []string

HTMLGetHREFs returns all URLs in the HTML as absolute URLs, URLs with different fragments are treated as one URL. Silently ignores an invalid baseUrlStr.

func HTMLGetImgSrc ¶

func HTMLGetImgSrc(baseUrlStr string, imgNode *html.Node) string

HTMLGetImgSrc returns the absolute URL of the image

func HTMLGetText ¶

func HTMLGetText(node *html.Node) string

HTMLGetText returns all text in the HTML. Result does not contain JavaScript code or text generated by JavaScript. This function is slow due to recursive tree traversal and multiple text passes.

func HTMLParseToNode ¶

func HTMLParseToNode(htmlContent any) *html.Node

HTMLParseToNode parses HTML content (string, []byte, or io.Reader) into an html.Node (returns an empty node on error). Should only be used for convenient testing: it ignores parse errors silently.

func HTMLRender ¶

func HTMLRender(node *html.Node) string

HTMLRender is a convenience function to render an HTML node to a string

func HTMLRenderIndent ¶

func HTMLRenderIndent(node *html.Node, prefix string, indent string) string

HTMLRenderIndent renders an HTML node to an indented string. Each element begins on a new line with "prefix" plus copies of "indent" for nesting depth. Output is XML-serialized (e.g. <br> becomes <br></br>): visual content is preserved but the node structure may differ if reparsed as HTML.

func HTMLXPath ¶

func HTMLXPath(htmlTree *html.Node, xPath string) ([]*html.Node, error)

HTMLXPath finds all HTML nodes matching the XPath query

func HashTextToInt ¶

func HashTextToInt(word string) int64

HashTextToInt is a fast, low-collision hash function

func NormalizeText ¶

func NormalizeText(text string) string

NormalizeText normalizes different Unicode representations of the same string. For example, "é" can be a single rune ("\u00e9") or "e" followed by an acute accent ("e\u0301"), both should be treated as equal in text processing. Vietnamese text has an extra problem: diacritic position, e.g. old style: òa, óa, ỏa, õa, ọa; new style: oà, oá, oả, oã, oạ

func RemoveRedundantSpace ¶

func RemoveRedundantSpace(text string) string

RemoveRedundantSpace replaces consecutive whitespace with a single space

func RemoveVietnamDiacritic ¶

func RemoveVietnamDiacritic(text string) string

Example: Đào => Dao

func TextToNGrams ¶

func TextToNGrams(text string, n int) map[string]int

TextToNGrams creates a set of n-grams (lowercase) from input text

func TextToWords ¶

func TextToWords(text string) []string

TextToWords splits text into a list of words, with punctuation removed

func WordsToNGrams ¶

func WordsToNGrams(words []string, n int) map[string]int

WordsToNGrams creates a set of n-grams from input words. An n-gram is a contiguous sequence of n words.

Types ¶

This section is empty.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
example

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL