Documentation
¶
Index ¶
- Variables
- func CheckValidXPath(xPath string) error
- func GenRandomVarName(wordLen int) string
- func GenRandomWord(minLen int, maxLen int, charList []rune) string
- func HTMLGetHREFs(baseUrlStr string, node *html.Node) []string
- func HTMLGetImgSrc(baseUrlStr string, imgNode *html.Node) string
- func HTMLGetText(node *html.Node) string
- func HTMLParseToNode(htmlContent any) *html.Node
- func HTMLRender(node *html.Node) string
- func HTMLRenderIndent(node *html.Node, prefix string, indent string) string
- func HTMLXPath(htmlTree *html.Node, xPath string) ([]*html.Node, error)
- func HashTextToInt(word string) int64
- func NormalizeText(text string) string
- func RemoveRedundantSpace(text string) string
- func RemoveVietnamDiacritic(text string) string
- func TextToNGrams(text string, n int) map[string]int
- func TextToWords(text string) []string
- func WordsToNGrams(words []string, n int) map[string]int
Constants ¶
This section is empty.
Variables ¶
var ( AlphaNumeric = toMapRunes(numerics + lowerAlphas + upperAlphas) AlphaNumericList = []rune(numerics + lowerAlphas + upperAlphas) AlphaNumericEnList = []rune( "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ") AlphaEnList = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ") )
runes grouped by type, used for character type checking (Vietnamese alphabet)
Functions ¶
func CheckValidXPath ¶
CheckValidXPath returns nil if the input XPath is valid.
func GenRandomVarName ¶
GenRandomVarName returns an alphanumeric string whose first character is a letter
func HTMLGetHREFs ¶
HTMLGetHREFs returns all URLs in the HTML as absolute URLs, URLs with different fragments are treated as one URL. Silently ignores an invalid baseUrlStr.
func HTMLGetImgSrc ¶
HTMLGetImgSrc returns the absolute URL of the image
func HTMLGetText ¶
HTMLGetText returns all text in the HTML. Result does not contain JavaScript code or text generated by JavaScript. This function is slow due to recursive tree traversal and multiple text passes.
func HTMLParseToNode ¶
HTMLParseToNode parses HTML content (string, []byte, or io.Reader) into an html.Node (returns an empty node on error). Should only be used for convenient testing: it ignores parse errors silently.
func HTMLRender ¶
HTMLRender is a convenience function to render an HTML node to a string
func HTMLRenderIndent ¶
HTMLRenderIndent renders an HTML node to an indented string. Each element begins on a new line with "prefix" plus copies of "indent" for nesting depth. Output is XML-serialized (e.g. <br> becomes <br></br>): visual content is preserved but the node structure may differ if reparsed as HTML.
func HashTextToInt ¶
HashTextToInt is a fast, low-collision hash function
func NormalizeText ¶
NormalizeText normalizes different Unicode representations of the same string. For example, "é" can be a single rune ("\u00e9") or "e" followed by an acute accent ("e\u0301"), both should be treated as equal in text processing. Vietnamese text has an extra problem: diacritic position, e.g. old style: òa, óa, ỏa, õa, ọa; new style: oà, oá, oả, oã, oạ
func RemoveRedundantSpace ¶
RemoveRedundantSpace replaces consecutive whitespace with a single space
func TextToNGrams ¶
TextToNGrams creates a set of n-grams (lowercase) from input text
func TextToWords ¶
TextToWords splits text into a list of words, with punctuation removed
Types ¶
This section is empty.