cirrus

package module
v0.0.0-...-9b83621 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 17, 2022 License: Apache-2.0 Imports: 10 Imported by: 0

README

cirrus

quick and dirty entity recognition

Cirrus attempts to serialize arbitrary natural text into a Go value using dictionary matching. At the moment it doesn't really do much, and shouldn't be used by anybody.

Cirrus tries to parse tokens into various values based on pretty simple heuristics like capitalization and the presence of a dollar sign. Some entities, like cardinality ("one", "many", "ten", etc) have to be matched against every token.

Roadmap

  • grouping of sequential entities
    • e.g. sequential cardinals can be grouped, two dozen becomes Result{Value: 24}
  • text classification
    • use wordnet?

Examples

  • 2021-10-22 is a date
  • $20 is a value in USD
  • 20mph is a value in miles per hour
  • Hong Kong is a city
  • Charles Dickens is a person
  • Microsoft or MSFT is a company
  • Australia is a country
Numbers
1.2
10e-3
1.00001
20E10
15
2/3
one
two thousand three hundred seventy five
Dates
Christmas eve
10/22/15
10.22.15
2005
december 5 2005 // variations thereof
Times
12:10
ten minutes and five seconds
Phone Numbers

Addresses
Email Addresses
  • probably best to use an established regexp for this
Web Addresses
  • best to use an established regexp or std's url.Parse
Tables
| name | age |
| john | 2 |
| jane | 12 |

name,age
john,2
jane,12

name age
"john" 2
"jane" 12

name    age
john    2
jane    12

Documentation

Overview

Classifier attempts to classify documents by topic.

Index

Constants

View Source
const (
	NO_UNIT = iota
	// metric
	METERS

	INCHES
	FEET
	MILES

	TIME
)
View Source
const (
	NONE = iota
	LINK
	QUANTITY
	DATE
	ORG
	CARDINAL
	MONEY
	EVENT
)

Variables

View Source
var (
	//go:embed data/cardinality.txt
	CARDINALITY_DICT_EMBED string
	CARDINALITY_DICT       = LoadDictionary(CARDINALITY_DICT_EMBED)
)
View Source
var (
	ErrNoExtract = errors.New("couldn't determine meaning")
)
View Source
var NUM_REGEXP = regexp.MustCompile(`\d+`)

Matches any sequence of one or more numbers

View Source
var SINGLE_NUMBER_REGEXP = regexp.MustCompile(`\d`)

Matches a single number

View Source
var UNIT_EXTRACT_REGEXP = regexp.MustCompile(`\d+\s?[A-Z..a-z]+`)

Matches one or more numbers followed by optional whitespace and a sequence of one or more letters, used for determining whether a string may be a quantity or not.

View Source
var UNIT_TYPE_REGEXP = regexp.MustCompile(`[a-zA-Z]+`)

Matches one or more letters, used as a filter for determining the unit

View Source
var Units = [][]string{
	NO_UNIT: {"none"},

	METERS: {"meters", "m"},

	INCHES: {"inches", "in"},
	FEET:   {"feet", "ft", "foot"},
	MILES:  {"mile", "mi"},

	TIME: {"minute", "second", "hour", "day"},
}

Recognized units.

Functions

func LoadDictionary

func LoadDictionary(data string) *txt.Node

Loads a dictionary from an embeded newline-delimited dictionary file. The dictionary words are placed in a trie.

Types

type Result

type Result struct {
	// type of result
	ResultType ResultType `json:"type"`
	// title/label of the unit, as in with graphs
	// e.g. "Number of Queries Per Second"
	Label string `json:"label"`
	// Custom data
	Data dataType `json:"data"`
	// value
	Value      string `json:"value"`
	Start, End uint
}

func Recognize

func Recognize(text string) ([]*Result, error)

Recognizes entities within a piece of natural text.

type ResultType

type ResultType int

func (ResultType) String

func (r ResultType) String() string

type Topic

type Topic struct {
	Name   string
	Weight float64
}

func Classify

func Classify(text string) []Topic

type Unit

type Unit int

func (Unit) String

func (u Unit) String() string

type UnitValue

type UnitValue struct {
	Value Unit
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL