tfidf

package module

v0.0.0-...-e91d5a5 Latest Latest Go to latest Published: Nov 8, 2016 License: GPL-3.0 Imports: 5 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/numbleroot/go-tfidf

Links

Open Source Insights

README ¶

tfidf in Go

An implementation of useful term frequency - inverse document frequency (tf-idf) functions in Go.

WARNING: Very recent (and quick) implementation, not guaranteed to be bug-free.

Documentation ¶

Overview ¶

This file was directly taken from: https://github.com/lytics/multibayes/blob/master/stopbytes.go Please refer to lytics' multibayes package for further information and credit.

Index ¶

Constants
func InverseDocumentFrequencies(documents [][]string, weighting weightingScheme) map[string]float64
func InverseDocumentFrequency(term string, stem bool, documents [][]string, weighting weightingScheme) float64
func TermFrequencies(compareDoc []string, documents [][]string) []float64
func TermFrequency(term string, stem bool, document []string, weighting weightingScheme) float64
func TokenizeDocument(document string) []string

Constants ¶

View Source

const (

	// Term frequency weightings:
	// * Binary weighting.
	TermWeightingBinary weightingScheme = 0
	// * Raw frequency weighting.
	TermWeightingRaw weightingScheme = 1
	// * Log normalization weighting.
	TermWeightingLog weightingScheme = 2
	// * Double normalization 0.5 weighting.
	TermWeightingDoubleHalf weightingScheme = 3
	// * Double normalization K weighting.
	TermWeightingDoubleK weightingScheme = 4

	// Inverse document frequency weightings:
	// * Unary weighting.
	InvDocWeightingUnary weightingScheme = 0
	// * Log weighting.
	InvDocWeightingLog weightingScheme = 1
	// * Log smooth weighting.
	InvDocWeightingLogSmooth weightingScheme = 2
	// * Log maximum weighting.
	InvDocWeightingLogMax weightingScheme = 3
	// * Probabilistic weighting.
	InvDocWeightingProb weightingScheme = 4
)

Variables ¶

This section is empty.

Functions ¶

func InverseDocumentFrequencies ¶

func InverseDocumentFrequencies(documents [][]string, weighting weightingScheme) map[string]float64

Wrapper function to retrieve the map[string]float64 representation of an inverse document frequency vector for all terms in the supplied corpus, e.g. all tokenized documents.

func InverseDocumentFrequency ¶

func InverseDocumentFrequency(term string, stem bool, documents [][]string, weighting weightingScheme) float64

Takes in a term, possibly stems it and counts its appearance in the supplied set of already tokenized documents. The resulting value will be altered by supplied weighting scheme.

func TermFrequencies ¶

func TermFrequencies(compareDoc []string, documents [][]string) []float64

This function takes in a compareDocument for which it will return the frequency of tokens in it. The number and order of tokens will be obtained by the given documents corpora. Note that compareDoc usually is in the corpora and both lists contain already tokenized elements.

func TermFrequency ¶

func TermFrequency(term string, stem bool, document []string, weighting weightingScheme) float64

This function calculates the number of occurencies of a given term in a given document. Based on the specified weighting scheme, the result value will be in a specific form. This functions expects a term, possibly stems it and looks up its frequency in an already tokenized document.

func TokenizeDocument ¶

func TokenizeDocument(document string) []string

Takes an input document in string representation and tokenizes it. Along the way, stop bytes in the document will be removed and each term left will only find its way into the output list in its stemmed form.

This function was heavily inspired by Allison Morgan's 'AddDocument' function from her 'tfidf' package: https://github.com/allisonmorgan/tfidf/blob/master/tfidf.go#L36

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL