Documentation
¶
Overview ¶
This file was directly taken from: https://github.com/lytics/multibayes/blob/master/stopbytes.go Please refer to lytics' multibayes package for further information and credit.
Index ¶
- Constants
- func InverseDocumentFrequencies(documents [][]string, weighting weightingScheme) map[string]float64
- func InverseDocumentFrequency(term string, stem bool, documents [][]string, weighting weightingScheme) float64
- func TermFrequencies(compareDoc []string, documents [][]string) []float64
- func TermFrequency(term string, stem bool, document []string, weighting weightingScheme) float64
- func TokenizeDocument(document string) []string
Constants ¶
const ( // Term frequency weightings: // * Binary weighting. TermWeightingBinary weightingScheme = 0 // * Raw frequency weighting. TermWeightingRaw weightingScheme = 1 // * Log normalization weighting. TermWeightingLog weightingScheme = 2 // * Double normalization 0.5 weighting. TermWeightingDoubleHalf weightingScheme = 3 // * Double normalization K weighting. TermWeightingDoubleK weightingScheme = 4 // Inverse document frequency weightings: // * Unary weighting. InvDocWeightingUnary weightingScheme = 0 // * Log weighting. InvDocWeightingLog weightingScheme = 1 // * Log smooth weighting. InvDocWeightingLogSmooth weightingScheme = 2 // * Log maximum weighting. InvDocWeightingLogMax weightingScheme = 3 // * Probabilistic weighting. InvDocWeightingProb weightingScheme = 4 )
Variables ¶
This section is empty.
Functions ¶
func InverseDocumentFrequencies ¶
Wrapper function to retrieve the map[string]float64 representation of an inverse document frequency vector for all terms in the supplied corpus, e.g. all tokenized documents.
func InverseDocumentFrequency ¶
func InverseDocumentFrequency(term string, stem bool, documents [][]string, weighting weightingScheme) float64
Takes in a term, possibly stems it and counts its appearance in the supplied set of already tokenized documents. The resulting value will be altered by supplied weighting scheme.
func TermFrequencies ¶
This function takes in a compareDocument for which it will return the frequency of tokens in it. The number and order of tokens will be obtained by the given documents corpora. Note that compareDoc usually is in the corpora and both lists contain already tokenized elements.
func TermFrequency ¶
This function calculates the number of occurencies of a given term in a given document. Based on the specified weighting scheme, the result value will be in a specific form. This functions expects a term, possibly stems it and looks up its frequency in an already tokenized document.
func TokenizeDocument ¶
Takes an input document in string representation and tokenizes it. Along the way, stop bytes in the document will be removed and each term left will only find its way into the output list in its stemmed form.
This function was heavily inspired by Allison Morgan's 'AddDocument' function from her 'tfidf' package: https://github.com/allisonmorgan/tfidf/blob/master/tfidf.go#L36
Types ¶
This section is empty.