txt

package module
v0.0.0-...-6683406 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 17, 2022 License: Apache-2.0 Imports: 4 Imported by: 1

README

txt

text utilities (golang)

txt is mainly a lot of text normalization utilities, primarily for full-text search. This library is primarily intended for my own use, but pull requests are always welcome. The primary focus is on performance, for use in a full-text search engine I'm building.

Utilities

  • fast levenshtein distance calculator (~1.3*10^12 ops/s)
  • an implementation of the Porter stemming algorithm
  • stopword remover
  • various reading functions
    • read time
    • word/char count
    • (not implemented) text difficulty (Flescher-Kincaid)
  • a generic Trie data structure
    • inserts: ~6.7*10^14 ops/s
    • exact matches: ~3.3*10^15 ops/s
    • partial matches: ~1.2*10^14 ops/s

Roadmap

  • text normalisation
    • color names (words -> hex)
    • URL normalizations
    • remove fractional numbers to nth precision
    • synoyms/thesaurus
      • replace multiple words with one word that is a synoym of it
  • text difficulty
  • support for multiple languages

Credits

Documentation

Overview

Utilities for n-grams.

Text normalization.

Porter stemming this is a port of https://medium.com/analytics-vidhya/building-a-stemmer-492e9a128e84 as I am not familiar with the Porter stemming algorithmn some fixes were made as many of the test words failed at various stages see https://tartarus.org/martin/PorterStemmer/def.txt for the spec

utilities for managing stopwords stopword list courtesy of https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html#fulltext-stopwords-stopwords-for-myisam-search-indexes our version is somewhat modified

Tokenization utilities, primarily for use as a tokenizer in a full-text search engine.

Index

Constants

This section is empty.

Variables

View Source
var COLORS_DICT_EMBED string

Default normalizer used if no normalizers are provided.

DefaultTokenizer removes stopwords and stems tokens.

View Source
var READING_WPM = 200

Average words read per minute

View Source
var STOPWORDS = map[string]bool{} /* 543 elements not displayed */

Functions

func CharCount

func CharCount(s string) int

func ContainsStopwords

func ContainsStopwords(str string) bool

func FilterStopwords

func FilterStopwords(tokens []string) []string

func Normalize

func Normalize(text string, options ...Normalizer) string

Normalizes a given string using the provided options. Options are executed in the order that they're provided in. If no options are provided, the default normalizer is used.

func ReadTime

func ReadTime(s string) int

Duration, in seconds, to read the provided string `s`.

func RemoveStopwords

func RemoveStopwords(str string) string

func Stem

func Stem(s string) string

Stems a given token. Note that this should only be used with a single token.

func StemTokens

func StemTokens(tokens []string) []string

Stems an array of words/tokens.

func Tokenize

func Tokenize(text string, splitter Splitter, normalizer []Normalizer, options ...Tokenizer) []string

Produces a list of normalized text tokens. If no options are provided, the DefaultTokenizer is used.

func WordCount

func WordCount(s string, spaces bool) int

Counts the number of words within a string.

func WordFrequency

func WordFrequency(s, word string) uint

func Words

func Words(s string) []string

Returns the individual words in a string, in lowercase. Also removes punctuation (".", ",", ":", ";", "!", "?"). Parantheses are kept, as well as brackets and quotation marks.

Types

type Node

type Node struct {
	// Pointers to child branches
	Kids map[rune]*Node
	// Custom data, inserted into the last child ('*') of a string's tree.
	Data interface{}
	// End of branch
	Done bool
	// Character representing current branch
	Character rune
	// Internal id. 0 is the id of the root node.
	Id uint32
}

Node is a node in a trie tree.

var (
	// English colors and their hex equivalents.
	Colors *Node = LoadDictionary(COLORS_DICT_EMBED, false)
)

func LoadDictionary

func LoadDictionary(data string, caseSensitive bool) *Node

Loads a replacer file from a given path. Files must be in the format: original,replaced value with each tuple on a new line. The values are loaded into memory; the replaced value is assigned the `Data` field on the final node of a trie branch, which can be accessed using Trie.NodeAt(x).

func NewTrie

func NewTrie() *Node

NewTrie creates a new trie. Note that this function only creates a root node.

func (*Node) At

func (n *Node) At(s string) (*Node, bool)

At returns the end node of the last provided string. If no node exists, then the second argument will be `false`.

func (*Node) Contains

func (n *Node) Contains(s string) bool

ExactContains determines whether the provided string is entirely within the trie.

func (*Node) Delete

func (n *Node) Delete(words ...string) bool

Delete removes words from a trie.

func (*Node) FuzzyContains

func (n *Node) FuzzyContains(s string, d int) bool

PartialContains checks if the provided string is completely within the tree. `d` is an optional depth value that controls how many characters of the provided string must be present sequentially in the tree. For example, providing (`exam`, 3) for a trie that already has `example` will check to make sure that `e`, `x`, and `a` are in the trie as children of the previous character. Setting this value to -1 or a value greater than the length of `s` is equivalent to setting it to len(s), as well as the ExactContains() method. Note that this method only searches for substrings at the beginning of a word.

func (*Node) Insert

func (n *Node) Insert(s string, data interface{})

Inserts a word into a trie.

func (*Node) String

func (n *Node) String() string

func (*Node) Words

func (n *Node) Words() []string

Returns a list of all words stored in the trie.

type Normalizer

type Normalizer func(text string) string

Normalizes a given string.

var (
	// Converts all characters to lowercase.
	NormalizerToLower Normalizer = func(text string) string { return strings.ToLower(text) }

	// Converts excess whitespace into a single space.
	NormalizeSingleSpace Normalizer = func(text string) string {
		return replaceSpaces(text)
	}

	// Removes most punctuation. Periods and slashes are not removed.
	NormalizePunctuation Normalizer = func(text string) string {
		return removeChars(text, ",", ";", "!", "?", "\"", "'", "(", ")", "&")
	}

	// Removes all punctuation, including periods and slashes.
	NormalizeAllPunctuation Normalizer = func(text string) string {
		return removeChars(NormalizePunctuation(text), ".", "/", "\\")
	}

	NormalizeSpecial Normalizer = func(text string) string {
		return removeChars(text, "!", "@", "#", "%", "^", "&", "*", "(", ")", "-", "_", "+", "=", "[", "]", "{", "}", ";", "'", "<", ">", "?", "~", "`")
	}

	// Replaces instances of hyphens with spaces.
	NormalizeHyphens Normalizer = func(text string) string {
		return strings.ReplaceAll(text, "-", " ")
	}
)

type Splitter

type Splitter func(text string) []string

Splits a string into individual tokens.

var (
	// Splits a string at non-alphanumeric characters (whitespace, punctuation, etc).
	SplitNonAlphanumeric Splitter = func(text string) []string {
		return strings.FieldsFunc(text, func(r rune) bool {
			return !unicode.IsNumber(r) && !unicode.IsLetter(r)
		})
	}

	DefaultSplitter Splitter = SplitNonAlphanumeric
)

type Tokenizer

type Tokenizer func(tokens []string) []string

Normalizes a list of tokens.

var (
	// Removes stopwords from a token list using the default stopword list.
	TokenizerStopwords Tokenizer = func(tokens []string) []string { return FilterStopwords(tokens) }
	// Stems tokens using a Porter stemmer.
	TokenizerStemmer Tokenizer = func(tokens []string) []string { return StemTokens(tokens) }

	// Normalizes color names (e.g. `red`, `navy`) into their hex values.
	TokenizerColors Tokenizer = func(tokens []string) []string {
		for i, v := range tokens {
			if n, exists := Colors.Find(v); exists {
				tokens[i] = string(n.Data.([]byte))
			}
		}

		return tokens
	}
)

type WordOffset

type WordOffset struct {
	Word   string
	Offset int
}

Occurence of a word in a string, as well as its offset within the string. todo: documentation on exact offset meaning

func WordOffsets

func WordOffsets(s string) []WordOffset

Returns the offsets and individual words in a string.

Directories

Path Synopsis
Levenshtein distance calculator
Levenshtein distance calculator

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL