tfidf

package module
v0.0.0-...-e91d5a5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 8, 2016 License: GPL-3.0 Imports: 5 Imported by: 2

README

tfidf in Go GoDoc

An implementation of useful term frequency - inverse document frequency (tf-idf) functions in Go.

WARNING: Very recent (and quick) implementation, not guaranteed to be bug-free.

Documentation

Overview

This file was directly taken from: https://github.com/lytics/multibayes/blob/master/stopbytes.go Please refer to lytics' multibayes package for further information and credit.

Index

Constants

View Source
const (

	// Term frequency weightings:
	// * Binary weighting.
	TermWeightingBinary weightingScheme = 0
	// * Raw frequency weighting.
	TermWeightingRaw weightingScheme = 1
	// * Log normalization weighting.
	TermWeightingLog weightingScheme = 2
	// * Double normalization 0.5 weighting.
	TermWeightingDoubleHalf weightingScheme = 3
	// * Double normalization K weighting.
	TermWeightingDoubleK weightingScheme = 4

	// Inverse document frequency weightings:
	// * Unary weighting.
	InvDocWeightingUnary weightingScheme = 0
	// * Log weighting.
	InvDocWeightingLog weightingScheme = 1
	// * Log smooth weighting.
	InvDocWeightingLogSmooth weightingScheme = 2
	// * Log maximum weighting.
	InvDocWeightingLogMax weightingScheme = 3
	// * Probabilistic weighting.
	InvDocWeightingProb weightingScheme = 4
)

Variables

This section is empty.

Functions

func InverseDocumentFrequencies

func InverseDocumentFrequencies(documents [][]string, weighting weightingScheme) map[string]float64

Wrapper function to retrieve the map[string]float64 representation of an inverse document frequency vector for all terms in the supplied corpus, e.g. all tokenized documents.

func InverseDocumentFrequency

func InverseDocumentFrequency(term string, stem bool, documents [][]string, weighting weightingScheme) float64

Takes in a term, possibly stems it and counts its appearance in the supplied set of already tokenized documents. The resulting value will be altered by supplied weighting scheme.

func TermFrequencies

func TermFrequencies(compareDoc []string, documents [][]string) []float64

This function takes in a compareDocument for which it will return the frequency of tokens in it. The number and order of tokens will be obtained by the given documents corpora. Note that compareDoc usually is in the corpora and both lists contain already tokenized elements.

func TermFrequency

func TermFrequency(term string, stem bool, document []string, weighting weightingScheme) float64

This function calculates the number of occurencies of a given term in a given document. Based on the specified weighting scheme, the result value will be in a specific form. This functions expects a term, possibly stems it and looks up its frequency in an already tokenized document.

func TokenizeDocument

func TokenizeDocument(document string) []string

Takes an input document in string representation and tokenizes it. Along the way, stop bytes in the document will be removed and each term left will only find its way into the output list in its stemmed form.

This function was heavily inspired by Allison Morgan's 'AddDocument' function from her 'tfidf' package: https://github.com/allisonmorgan/tfidf/blob/master/tfidf.go#L36

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL