Documentation
¶
Overview ¶
Utilities for n-grams.
Text normalization.
Porter stemming this is a port of https://medium.com/analytics-vidhya/building-a-stemmer-492e9a128e84 as I am not familiar with the Porter stemming algorithmn some fixes were made as many of the test words failed at various stages see https://tartarus.org/martin/PorterStemmer/def.txt for the spec
utilities for managing stopwords stopword list courtesy of https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html#fulltext-stopwords-stopwords-for-myisam-search-indexes our version is somewhat modified
Tokenization utilities, primarily for use as a tokenizer in a full-text search engine.
Index ¶
- Variables
- func CharCount(s string) int
- func ContainsStopwords(str string) bool
- func FilterStopwords(tokens []string) []string
- func Normalize(text string, options ...Normalizer) string
- func ReadTime(s string) int
- func RemoveStopwords(str string) string
- func Stem(s string) string
- func StemTokens(tokens []string) []string
- func Tokenize(text string, splitter Splitter, normalizer []Normalizer, options ...Tokenizer) []string
- func WordCount(s string, spaces bool) int
- func WordFrequency(s, word string) uint
- func Words(s string) []string
- type Node
- type Normalizer
- type Splitter
- type Tokenizer
- type WordOffset
Constants ¶
This section is empty.
Variables ¶
var COLORS_DICT_EMBED string
var DefaultNormalizer = []Normalizer{ NormalizerToLower, NormalizePunctuation, NormalizeSingleSpace, NormalizeHyphens, NormalizeSpecial, }
Default normalizer used if no normalizers are provided.
var DefaultTokenizer = []Tokenizer{ TokenizerStopwords, TokenizerStemmer, }
DefaultTokenizer removes stopwords and stems tokens.
var READING_WPM = 200
Average words read per minute
var STOPWORDS = map[string]bool{} /* 543 elements not displayed */
Functions ¶
func ContainsStopwords ¶
func FilterStopwords ¶
func Normalize ¶
func Normalize(text string, options ...Normalizer) string
Normalizes a given string using the provided options. Options are executed in the order that they're provided in. If no options are provided, the default normalizer is used.
func RemoveStopwords ¶
func Tokenize ¶
func Tokenize(text string, splitter Splitter, normalizer []Normalizer, options ...Tokenizer) []string
Produces a list of normalized text tokens. If no options are provided, the DefaultTokenizer is used.
func WordFrequency ¶
Types ¶
type Node ¶
type Node struct {
// Pointers to child branches
Kids map[rune]*Node
// Custom data, inserted into the last child ('*') of a string's tree.
Data interface{}
// End of branch
Done bool
// Character representing current branch
Character rune
// Internal id. 0 is the id of the root node.
Id uint32
}
Node is a node in a trie tree.
var ( // English colors and their hex equivalents. Colors *Node = LoadDictionary(COLORS_DICT_EMBED, false) )
func LoadDictionary ¶
Loads a replacer file from a given path. Files must be in the format: original,replaced value with each tuple on a new line. The values are loaded into memory; the replaced value is assigned the `Data` field on the final node of a trie branch, which can be accessed using Trie.NodeAt(x).
func NewTrie ¶
func NewTrie() *Node
NewTrie creates a new trie. Note that this function only creates a root node.
func (*Node) At ¶
At returns the end node of the last provided string. If no node exists, then the second argument will be `false`.
func (*Node) Contains ¶
ExactContains determines whether the provided string is entirely within the trie.
func (*Node) FuzzyContains ¶
PartialContains checks if the provided string is completely within the tree. `d` is an optional depth value that controls how many characters of the provided string must be present sequentially in the tree. For example, providing (`exam`, 3) for a trie that already has `example` will check to make sure that `e`, `x`, and `a` are in the trie as children of the previous character. Setting this value to -1 or a value greater than the length of `s` is equivalent to setting it to len(s), as well as the ExactContains() method. Note that this method only searches for substrings at the beginning of a word.
type Normalizer ¶
Normalizes a given string.
var ( // Converts all characters to lowercase. NormalizerToLower Normalizer = func(text string) string { return strings.ToLower(text) } // Converts excess whitespace into a single space. NormalizeSingleSpace Normalizer = func(text string) string { return replaceSpaces(text) } // Removes most punctuation. Periods and slashes are not removed. NormalizePunctuation Normalizer = func(text string) string { return removeChars(text, ",", ";", "!", "?", "\"", "'", "(", ")", "&") } // Removes all punctuation, including periods and slashes. NormalizeAllPunctuation Normalizer = func(text string) string { return removeChars(NormalizePunctuation(text), ".", "/", "\\") } NormalizeSpecial Normalizer = func(text string) string { return removeChars(text, "!", "@", "#", "%", "^", "&", "*", "(", ")", "-", "_", "+", "=", "[", "]", "{", "}", ";", "'", "<", ">", "?", "~", "`") } // Replaces instances of hyphens with spaces. NormalizeHyphens Normalizer = func(text string) string { return strings.ReplaceAll(text, "-", " ") } )
type Splitter ¶
Splits a string into individual tokens.
var ( // Splits a string at non-alphanumeric characters (whitespace, punctuation, etc). SplitNonAlphanumeric Splitter = func(text string) []string { return strings.FieldsFunc(text, func(r rune) bool { return !unicode.IsNumber(r) && !unicode.IsLetter(r) }) } DefaultSplitter Splitter = SplitNonAlphanumeric )
type Tokenizer ¶
Normalizes a list of tokens.
var ( // Removes stopwords from a token list using the default stopword list. TokenizerStopwords Tokenizer = func(tokens []string) []string { return FilterStopwords(tokens) } // Stems tokens using a Porter stemmer. TokenizerStemmer Tokenizer = func(tokens []string) []string { return StemTokens(tokens) } // Normalizes color names (e.g. `red`, `navy`) into their hex values. TokenizerColors Tokenizer = func(tokens []string) []string { for i, v := range tokens { if n, exists := Colors.Find(v); exists { tokens[i] = string(n.Data.([]byte)) } } return tokens } )
type WordOffset ¶
Occurence of a word in a string, as well as its offset within the string. todo: documentation on exact offset meaning
func WordOffsets ¶
func WordOffsets(s string) []WordOffset
Returns the offsets and individual words in a string.