txt

package module

v0.0.0-...-6683406 Latest Latest Go to latest Published: Nov 17, 2022 License: Apache-2.0 Imports: 4 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/hvlck/txt

Links

Open Source Insights

README ¶

txt

text utilities (golang)

txt is mainly a lot of text normalization utilities, primarily for full-text search. This library is primarily intended for my own use, but pull requests are always welcome. The primary focus is on performance, for use in a full-text search engine I'm building.

Utilities

fast levenshtein distance calculator (~1.3*10^12 ops/s)
an implementation of the Porter stemming algorithm
stopword remover
various reading functions
- read time
- word/char count
- (not implemented) text difficulty (Flescher-Kincaid)
a generic Trie data structure
- inserts: ~6.7*10^14 ops/s
- exact matches: ~3.3*10^15 ops/s
- partial matches: ~1.2*10^14 ops/s

Roadmap

text normalisation
- color names (words -> hex)
- URL normalizations
- remove fractional numbers to nth precision
- synoyms/thesaurus
  - replace multiple words with one word that is a synoym of it
text difficulty
support for multiple languages

Credits

Documentation ¶

Overview ¶

Utilities for n-grams.

Text normalization.

Porter stemming this is a port of https://medium.com/analytics-vidhya/building-a-stemmer-492e9a128e84 as I am not familiar with the Porter stemming algorithmn some fixes were made as many of the test words failed at various stages see https://tartarus.org/martin/PorterStemmer/def.txt for the spec

utilities for managing stopwords stopword list courtesy of https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html#fulltext-stopwords-stopwords-for-myisam-search-indexes our version is somewhat modified

Tokenization utilities, primarily for use as a tokenizer in a full-text search engine.

Index ¶

Variables
func CharCount(s string) int
func ContainsStopwords(str string) bool
func FilterStopwords(tokens []string) []string
func Normalize(text string, options ...Normalizer) string
func ReadTime(s string) int
func RemoveStopwords(str string) string
func Stem(s string) string
func StemTokens(tokens []string) []string
func Tokenize(text string, splitter Splitter, normalizer []Normalizer, options ...Tokenizer) []string
func WordCount(s string, spaces bool) int
func WordFrequency(s, word string) uint
func Words(s string) []string
type Node
- func LoadDictionary(data string, caseSensitive bool) *Node
- func NewTrie() *Node
- func (n *Node) At(s string) (*Node, bool)
- func (n *Node) Contains(s string) bool
- func (n *Node) Delete(words ...string) bool
- func (n *Node) FuzzyContains(s string, d int) bool
- func (n *Node) Insert(s string, data interface{})
- func (n *Node) String() string
- func (n *Node) Words() []string
type Normalizer
type Splitter
type Tokenizer
type WordOffset
- func WordOffsets(s string) []WordOffset

Constants ¶

This section is empty.

Variables ¶

View Source

var COLORS_DICT_EMBED string

View Source

var DefaultNormalizer = []Normalizer{
	NormalizerToLower,
	NormalizePunctuation,
	NormalizeSingleSpace,
	NormalizeHyphens,
	NormalizeSpecial,
}

Default normalizer used if no normalizers are provided.

View Source

var DefaultTokenizer = []Tokenizer{
	TokenizerStopwords,
	TokenizerStemmer,
}

DefaultTokenizer removes stopwords and stems tokens.

View Source

var READING_WPM = 200

Average words read per minute

View Source

var STOPWORDS = map[string]bool{} /* 543 elements not displayed */

Functions ¶

func CharCount ¶

func CharCount(s string) int

func ContainsStopwords ¶

func ContainsStopwords(str string) bool

func FilterStopwords ¶

func FilterStopwords(tokens []string) []string

func Normalize ¶

func Normalize(text string, options ...Normalizer) string

Normalizes a given string using the provided options. Options are executed in the order that they're provided in. If no options are provided, the default normalizer is used.

func ReadTime ¶

func ReadTime(s string) int

Duration, in seconds, to read the provided string `s`.

func RemoveStopwords ¶

func RemoveStopwords(str string) string

func Stem ¶

func Stem(s string) string

Stems a given token. Note that this should only be used with a single token.

func StemTokens ¶

func StemTokens(tokens []string) []string

Stems an array of words/tokens.

func Tokenize ¶

func Tokenize(text string, splitter Splitter, normalizer []Normalizer, options ...Tokenizer) []string

Produces a list of normalized text tokens. If no options are provided, the DefaultTokenizer is used.

func WordCount ¶

func WordCount(s string, spaces bool) int

Counts the number of words within a string.

func WordFrequency ¶

func WordFrequency(s, word string) uint

func Words ¶

func Words(s string) []string

Returns the individual words in a string, in lowercase. Also removes punctuation (".", ",", ":", ";", "!", "?"). Parantheses are kept, as well as brackets and quotation marks.

Types ¶

type Node ¶

type Node struct {
	// Pointers to child branches
	Kids map[rune]*Node
	// Custom data, inserted into the last child ('*') of a string's tree.
	Data interface{}
	// End of branch
	Done bool
	// Character representing current branch
	Character rune
	// Internal id. 0 is the id of the root node.
	Id uint32
}

Node is a node in a trie tree.

var (
	// English colors and their hex equivalents.
	Colors *Node = LoadDictionary(COLORS_DICT_EMBED, false)
)

func LoadDictionary ¶

func LoadDictionary(data string, caseSensitive bool) *Node

Loads a replacer file from a given path. Files must be in the format: original,replaced value with each tuple on a new line. The values are loaded into memory; the replaced value is assigned the `Data` field on the final node of a trie branch, which can be accessed using Trie.NodeAt(x).

func NewTrie ¶

func NewTrie() *Node

NewTrie creates a new trie. Note that this function only creates a root node.

func (*Node) At ¶

func (n *Node) At(s string) (*Node, bool)

At returns the end node of the last provided string. If no node exists, then the second argument will be `false`.

func (*Node) Contains ¶

func (n *Node) Contains(s string) bool

ExactContains determines whether the provided string is entirely within the trie.

func (*Node) Delete ¶

func (n *Node) Delete(words ...string) bool

Delete removes words from a trie.

func (*Node) FuzzyContains ¶

func (n *Node) FuzzyContains(s string, d int) bool

PartialContains checks if the provided string is completely within the tree. `d` is an optional depth value that controls how many characters of the provided string must be present sequentially in the tree. For example, providing (`exam`, 3) for a trie that already has `example` will check to make sure that `e`, `x`, and `a` are in the trie as children of the previous character. Setting this value to -1 or a value greater than the length of `s` is equivalent to setting it to len(s), as well as the ExactContains() method. Note that this method only searches for substrings at the beginning of a word.

func (*Node) Insert ¶

func (n *Node) Insert(s string, data interface{})

Inserts a word into a trie.

func (*Node) String ¶

func (n *Node) String() string

func (*Node) Words ¶

func (n *Node) Words() []string

Returns a list of all words stored in the trie.

type Normalizer ¶

type Normalizer func(text string) string

Normalizes a given string.

var (
	// Converts all characters to lowercase.
	NormalizerToLower Normalizer = func(text string) string { return strings.ToLower(text) }

	// Converts excess whitespace into a single space.
	NormalizeSingleSpace Normalizer = func(text string) string {
		return replaceSpaces(text)
	}

	// Removes most punctuation. Periods and slashes are not removed.
	NormalizePunctuation Normalizer = func(text string) string {
		return removeChars(text, ",", ";", "!", "?", "\"", "'", "(", ")", "&")
	}

	// Removes all punctuation, including periods and slashes.
	NormalizeAllPunctuation Normalizer = func(text string) string {
		return removeChars(NormalizePunctuation(text), ".", "/", "\\")
	}

	NormalizeSpecial Normalizer = func(text string) string {
		return removeChars(text, "!", "@", "#", "%", "^", "&", "*", "(", ")", "-", "_", "+", "=", "[", "]", "{", "}", ";", "'", "<", ">", "?", "~", "`")
	}

	// Replaces instances of hyphens with spaces.
	NormalizeHyphens Normalizer = func(text string) string {
		return strings.ReplaceAll(text, "-", " ")
	}
)

type Splitter ¶

type Splitter func(text string) []string

Splits a string into individual tokens.

var (
	// Splits a string at non-alphanumeric characters (whitespace, punctuation, etc).
	SplitNonAlphanumeric Splitter = func(text string) []string {
		return strings.FieldsFunc(text, func(r rune) bool {
			return !unicode.IsNumber(r) && !unicode.IsLetter(r)
		})
	}

	DefaultSplitter Splitter = SplitNonAlphanumeric
)

type Tokenizer ¶

type Tokenizer func(tokens []string) []string

Normalizes a list of tokens.

var (
	// Removes stopwords from a token list using the default stopword list.
	TokenizerStopwords Tokenizer = func(tokens []string) []string { return FilterStopwords(tokens) }
	// Stems tokens using a Porter stemmer.
	TokenizerStemmer Tokenizer = func(tokens []string) []string { return StemTokens(tokens) }

	// Normalizes color names (e.g. `red`, `navy`) into their hex values.
	TokenizerColors Tokenizer = func(tokens []string) []string {
		for i, v := range tokens {
			if n, exists := Colors.Find(v); exists {
				tokens[i] = string(n.Data.([]byte))
			}
		}

		return tokens
	}
)

type WordOffset ¶

type WordOffset struct {
	Word   string
	Offset int
}

Occurence of a word in a string, as well as its offset within the string. todo: documentation on exact offset meaning

func WordOffsets ¶

func WordOffsets(s string) []WordOffset

Returns the offsets and individual words in a string.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
lev Levenshtein distance calculator	Levenshtein distance calculator

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL