intentnlu

package module
v0.3.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 13, 2026 License: MIT Imports: 25 Imported by: 0

README

intent-nlu

Lightweight, embeddable intent classification engine for Go.

  • Module: github.com/godeps/intent-nlu
  • Chinese docs: README.zh-CN.md
  • Core stack:
    • Tokenization: github.com/go-ego/gse (Chinese), normalized splitter for non-CJK
    • Classifier: github.com/jbrukh/bayesian

What It Provides

  1. Low-latency pre-LLM intent recognition (~1ms/request).
  2. Skill routing for creative pipelines (video, image, audio, 3D, analysis).
  3. Tool routing for operational intents (search, code, tasks, files, data, docs, etc.).
  4. Deterministic train/val/test evaluation pipeline.
  5. Per-intent threshold calibration (optional).
  6. Intent taxonomy normalization (aliases -> canonical intents).
  7. Multi-language routing (zh, en, extensible).
  8. Hybrid policy (rules -> NLU -> fallback/LLM).
  9. Data feedback loop for active dataset improvement.
  10. Reproducibility metadata in model meta and bundle manifest.

Repository Layout

intent-nlu/
  cmd/
    intent-nlu-train/            # train one language model
    intent-nlu-predict/          # predict by single model / model map / bundle
    intent-nlu-bundle/           # build multilingual bundle from trained models
    intent-nlu-feedback/         # feedback ingestion and dataset/review update
  dataset/chatterbot/          # chatterbot corpus loader
  datasets/
    default/
      zh_business.csv          # business intents (calendar, weather, greeting)
      en_business.csv
      zh_skill_routing.csv     # skill routing intents (creative, analysis, chat)
      en_skill_routing.csv
      zh_tools_routing.csv     # tool routing intents (search, code, tasks, files, etc.)
      en_tools_routing.csv
      zh_tools_boost.csv       # supplemental short-phrase samples for tool intents
      en_tools_boost.csv
      zh_tools_boost2.csv      # targeted samples for weak intents (debug, analyze, search)
      en_tools_boost2.csv
    generated/
      *_train.csv              # effective training samples
      *_file_map.yaml          # auto-generated chatterbot mappings
      eval/*.json              # evaluation/model meta snapshots
    feedback/
      review/                  # low-confidence/unknown queue
      archive/                 # optional archived feedback
  docs/
    architecture.md
    skill-routing-integration.md
  examples/
    file_intent_map.yaml
  models/
    model-zh/
    model-en/
    multilingual/
  scripts/
    train_chatterbot_models.sh
    feedback_loop.sh

  # core package
  types.go
  tokenizer.go
  language.go
  taxonomy.go
  evaluation.go
  trainer.go
  model.go
  engine.go
  router.go
  router_bundle.go
  hybrid_policy.go
  embedded_bundle.go

Quick Start

1) Run tests
make test
2) Reproducible local evaluation (make eval)
make eval

Outputs:

  • datasets/generated/eval/zh_eval.json
  • datasets/generated/eval/en_eval.json

Both reports include split metrics (accuracy, macro-F1, micro-F1, confusion matrix, per-intent metrics) and training metadata.

3) One-click corpus training (zh,en default)
make train

Outputs:

  1. models/model-zh, models/model-en
  2. models/multilingual
  3. datasets/generated/{zh,en}_train.csv
  4. datasets/generated/eval/{zh,en}_meta.json

Intent Classes

Default embedded models include business, skill routing, tool routing, and chitchat intents:

  • zh: 20 canonical classes
  • en: 20 canonical classes
  • fallback: unknown
Skill Routing Intents
Intent Description ZH Samples EN Samples
creative_video Video production, editing, TVC, ads 200 200
creative_image Image generation, poster, illustration 200 203
creative_audio Music, sound effects, audio production 100 100
creative_3d 3D modeling, rendering 80 80
media_analysis Video/image understanding, description 100 100
general_chat Chitchat, questions, non-creative tasks 300 300
Tool Routing Intents

Map user intent to saker builtin tools and operational skills:

Intent Description Saker Tools ZH F1 EN F1
web_search Search the web, look up docs, find info web_search, web_fetch, browser 0.57 0.61
coding_assist Write/debug/run code, fix bugs, tests bash, edit, read, write, grep, glob 0.69 0.72
task_management Create/list/update tasks, kanban board task_, kanban_ 0.90 0.93
file_operation Download, read, write, save files fetch_file, read, write 0.88 1.00
knowledge_qa Recall past decisions, remember context memory_read, recall_context 0.96 0.88
workflow_automation Schedule cron jobs, automate pipelines workflow, cron, loop 0.53 0.67
data_analysis Analyze metrics, compute stats, find patterns bash, canvas_table_write 0.78 0.88
document_creation Write docs, README, guides, reports canvas_create_node, write 0.71 0.89
translation Translate text between languages — (common user need) 0.86 0.93
summarization Summarize logs, reports, discussions video_summarizer 0.67 0.80
Business Intents
Intent Description zh en
calendar_info Date, weekday, holiday, lunar/solar calendar queries Yes Yes
weather_info Weather forecast, rain, temperature Yes Yes
chitchat_greeting Direct short greetings (hi/hello/good morning) Yes Yes
Chitchat Intents
Intent Description zh en
chitchat_greetings Greeting variants from chatterbot corpus Yes Yes
chitchat_ai AI/assistant topic small talk Yes Yes
chitchat_botprofile Bot identity/capabilities/preferences Yes Yes
chitchat_conversations Generic open-domain conversation Yes Yes
chitchat_emotion Emotion/mood/support style casual talk Yes Yes
chitchat_food Food and drink discussion Yes Yes
chitchat_gossip Gossip/celebrity/light social topics Yes Yes
chitchat_history History trivia chat Yes Yes
chitchat_humor Jokes/funny content Yes Yes
chitchat_literature Books/writing/literature chat Yes Yes
chitchat_money Money/finance-light casual talk Yes Yes
chitchat_movies Movies/entertainment chat Yes Yes
chitchat_politics Politics social chat Yes Yes
chitchat_psychology Psychology/personality casual topics Yes Yes
chitchat_science Science trivia chat Yes Yes
chitchat_sports Sports chat Yes Yes
chitchat_trivia General trivia/knowledge snippets Yes Yes
chitchat_coding Programming/dev casual topics No Yes
chitchat_computers Computer/device/software casual topics No Yes
chitchat_health Health/wellness casual topics No Yes
chitchat_tech_support Light technical support style chat No Yes
Taxonomy Aliases

Applied at inference time via NormalizeIntent():

// Creative routing
"video_production"  -> "creative_video"
"video_editing"     -> "creative_video"
"film_production"   -> "creative_video"
"ad_production"     -> "creative_video"
"image_generation"  -> "creative_image"
"poster_design"     -> "creative_image"
"music_creation"    -> "creative_audio"
"audio_production"  -> "creative_audio"
"3d_modeling"       -> "creative_3d"
"video_analysis"    -> "media_analysis"
"image_analysis"    -> "media_analysis"

// Tool routing
"search"            -> "web_search"
"internet_search"   -> "web_search"
"lookup"            -> "web_search"
"google"            -> "web_search"
"code"              -> "coding_assist"
"programming"       -> "coding_assist"
"debug"             -> "coding_assist"
"fix_code"          -> "coding_assist"
"write_code"        -> "coding_assist"
"task"              -> "task_management"
"todo"              -> "task_management"
"kanban"            -> "task_management"
"download"          -> "file_operation"
"upload"            -> "file_operation"
"read_file"         -> "file_operation"
"save_file"         -> "file_operation"
"recall"            -> "knowledge_qa"
"remember"          -> "knowledge_qa"
"schedule"          -> "workflow_automation"
"cron"              -> "workflow_automation"
"automate"          -> "workflow_automation"
"analyze_data"      -> "data_analysis"
"statistics"        -> "data_analysis"
"metrics"           -> "data_analysis"
"create_doc"        -> "document_creation"
"write_doc"         -> "document_creation"
"documentation"     -> "document_creation"
"translate"         -> "translation"
"localize"          -> "translation"
"summarize"         -> "summarization"
"tldr"              -> "summarization"
"digest"            -> "summarization"
Supported Languages
Language Code Default Embedded Model Auto Detect Notes
Chinese zh Yes Yes gse tokenizer, 20 canonical classes
English en Yes Yes normalized tokenizer, 20 canonical classes
Japanese ja No (train yourself) Yes language detection supported
Korean ko No (train yourself) Yes language detection supported
Routing Modes and Thresholds
Mode Suggested Settings Behavior
Candidate handoff to LLM/tool planner CandidateMode: true, TopK: 3-5 Prioritizes recall; possible tools remain visible for final LLM selection
Direct tool execution 0.75 - 0.85 threshold Prioritizes precision; only high-confidence intents execute without LLM review
Business routing (balanced) 0.60 - 0.70 threshold Good default for deterministic intent dispatch
Default baseline 0.55 threshold Current training default if no per-intent override
Uncertain direct route < threshold => unknown Keep Candidates; route to fallback or LLM
Operational Notes
Topic Risk Recommendation
Fine-grained classes Confusion across similar chitchat intents Keep enough per-intent samples and evaluate confusion matrix
Corpus bias chatterbot data is mostly chitchat Always mix business/skill routing CSV for production tasks
Multilingual routing Short/mixed text can route wrong language Use language hint for critical paths
Threshold drift Retraining changes confidence distribution Re-calibrate thresholds and compare eval/*.json every release
Tool candidate loss High thresholds can hide plausible tools from LLM planners Use CandidateMode and monitor TopK candidate recall
Embedded bundle updates New models require dependency rebuild Pin model version and release notes with each update

Training Workflows

./scripts/train_chatterbot_models.sh \
  --langs zh,en \
  --threshold 0.55 \
  --split-enabled true \
  --train-ratio 0.8 \
  --val-ratio 0.1 \
  --test-ratio 0.1 \
  --seed 42 \
  --auto-calibrate true \
  --merge-bundle true \
  --bundle-dir ./models/multilingual

What it does:

  1. Clone/update chatterbot-corpus.
  2. Auto-generate file->intent mapping (chitchat_<file>).
  3. Auto-discover and merge all datasets/default/<lang>_*.csv files (business, skill routing, tool routing, boost data).
  4. Train models with split/evaluation/calibration.
  5. Build multilingual bundle via cmd/intent-nlu-bundle.
B) Manual training CLI
GOWORK=off go run ./cmd/intent-nlu-train \
  -lang zh \
  -corpus-root /path/to/chatterbot_corpus/data/chinese \
  -file-map ./examples/file_intent_map.yaml \
  -extra-csv ./datasets/default/zh_business.csv,./datasets/default/zh_skill_routing.csv \
  -dump-samples ./datasets/generated/zh_train.csv \
  -eval-report ./datasets/generated/eval/zh_meta.json \
  -out ./models/model-zh \
  -version 2026.05.31.zh.1 \
  -threshold 0.55 \
  -split-enabled=true \
  -train-ratio 0.8 \
  -val-ratio 0.1 \
  -test-ratio 0.1 \
  -seed 42 \
  -auto-calibrate-thresholds=true

Important flags:

  • Data: -corpus-root, -file-map, -category-map, -extra-csv (comma-separated multi-file)
  • Split/eval: -split-enabled, -train-ratio, -val-ratio, -test-ratio, -seed, -eval-report
  • Threshold: -threshold, -thresholds, -auto-calibrate-thresholds
  • Taxonomy: -disable-taxonomy (default true), -taxonomy-aliases
  • Reproducibility source: -source-name, -source-version, -source-revision, -source-repo-url, -source-commit

Bundle Build CLI

GOWORK=off go run ./cmd/intent-nlu-bundle \
  -bundle-dir ./models/multilingual \
  -models "zh=./models/model-zh,en=./models/model-en" \
  -default-lang zh \
  -version 2026.05.31.bundle.1 \
  -corpus-repo-url https://github.com/gunthercox/chatterbot-corpus.git \
  -corpus-commit <commit> \
  -training-params "seed=42,train_ratio=0.8,val_ratio=0.1,test_ratio=0.1"

Prediction

Single model
GOWORK=off go run ./cmd/intent-nlu-predict \
  -model ./models/model-zh \
  -text "帮我做个产品视频" \
  -lang auto \
  -topk 3

Recall-first candidate output for LLM/tool-planner selection:

GOWORK=off go run ./cmd/intent-nlu-predict \
  -bundle ./models/multilingual \
  -text "maybe look this up and summarize it" \
  -lang auto \
  -topk 5 \
  -candidate-mode
Multi-model map
GOWORK=off go run ./cmd/intent-nlu-predict \
  -models "zh=./models/model-zh,en=./models/model-en" \
  -text "create a poster" \
  -lang auto
Bundle
GOWORK=off go run ./cmd/intent-nlu-predict \
  -bundle ./models/multilingual \
  -text "做一个3D模型" \
  -lang auto
No model flags (use embedded default bundle)
GOWORK=off go run ./cmd/intent-nlu-predict \
  -text "analyze this video" \
  -lang auto

If -bundle, -models, and -model are all omitted, the command loads embedded default models.

Embedded Bundle (for dependency consumers)

intent-nlu embeds default multilingual models into the package. When another Go service imports this module, it can load models without shipping external files.

import intentnlu "github.com/godeps/intent-nlu"

router, err := intentnlu.NewRouterFromEmbedded()
if err != nil {
    panic(err)
}

pred, err := router.Predict(context.Background(), "帮我画一张海报", intentnlu.PredictOptions{
    TopK:         3,
    LanguageHint: "zh",
})
// pred.Intent == "creative_image", pred.Confidence == 0.96

Optional custom extraction cache directory:

router, err := intentnlu.NewRouterFromEmbeddedIn("./.cache/intent-nlu")

Feedback Loop

Use model feedback CSV to:

  1. Append human-labeled rows into datasets/default/<lang>_business.csv
  2. Put low-confidence/unknown rows into review queue
./scripts/feedback_loop.sh --input ./tmp/feedback.csv

Supported CSV headers:

  • required: text
  • optional aliases:
    • language: language
    • predicted intent: pred_intent|intent|predicted_intent
    • score: confidence|score
    • human label: final_intent|human_intent|label

Package Usage

Engine
engine, err := intentnlu.NewEngineFromDir("./models/model-zh")
if err != nil {
    panic(err)
}

pred, err := engine.Predict(context.Background(), "做一段背景音乐", intentnlu.PredictOptions{
    TopK:         3,
    LanguageHint: "auto",
})
// pred.Intent == "creative_audio"
Router
router, err := intentnlu.NewRouterFromBundle("./models/multilingual")
if err != nil {
    panic(err)
}

pred, err := router.Predict(context.Background(), "create a 3D model", intentnlu.PredictOptions{
    TopK:         3,
    LanguageHint: "auto",
})
// pred.Intent == "creative_3d"

For LLM/tool-planner handoff, prefer recall-first candidates:

pred, err := router.Predict(context.Background(), userText, intentnlu.PredictOptions{
    TopK:          5,
    LanguageHint:  "auto",
    CandidateMode: true,
})
// pred.Candidates contains the ranked possible tools/intents even when a
// direct-routing threshold would have rejected the top intent.
Hybrid Policy (rules -> NLU -> candidate/fallback)
policy := &intentnlu.HybridPolicy{
    Router: router,
    Rules: []intentnlu.DeterministicRule{
        {ID: "r1", Intent: "video_production", ContainsAny: []string{"tvc", "宣传片制作"}},
    },
}
_ = policy.Prepare() // taxonomy normalizes: "video_production" -> "creative_video"

decision, err := policy.Decide(context.Background(), userText, intentnlu.PredictOptions{
    TopK:          5,
    CandidateMode: true,
})
// decision.Route: rule | nlu | candidate | fallback
// decision.ShouldCallLLM tells whether to continue into LLM

Notes

  1. chatterbot-corpus is mostly chitchat; business and skill routing intents need curated data.
  2. Multilingual bundle is a packaging format, not one fused multilingual classifier.
  3. For LLM final selection, optimize TopK candidate recall before single-label precision.
  4. Generated artifacts can grow quickly; plan storage strategy by environment.

Commands Summary

make test                                  # run all tests
make eval                                  # reproducible evaluation (CSV only)
make train                                 # full training + bundling
./scripts/feedback_loop.sh --input <csv>   # feedback data ingestion

Documentation

Index

Constants

View Source
const (
	HybridRouteRule      = "rule"
	HybridRouteNLU       = "nlu"
	HybridRouteCandidate = "candidate"
	HybridRouteFallback  = "fallback"
)
View Source
const (
	// DefaultUnknownIntent is returned when confidence is below threshold.
	DefaultUnknownIntent = "unknown"
	// ModelBinaryFile is the bayesian model artifact file name.
	ModelBinaryFile = "model.gob"
	// ModelMetaFile is the metadata artifact file name.
	ModelMetaFile = "meta.json"
)
View Source
const (
	// BundleManifestFileName is the default file name of multilingual router bundle manifest.
	BundleManifestFileName = "manifest.json"
)

Variables

This section is empty.

Functions

func DefaultIntentAliases

func DefaultIntentAliases() map[string]string

DefaultIntentAliases returns a copy of the stable intent taxonomy aliases.

func ExtractEmbeddedBundle

func ExtractEmbeddedBundle(cacheDir string) (string, error)

ExtractEmbeddedBundle extracts embedded multilingual bundle to local filesystem and returns bundle directory. Extraction is idempotent for the same cacheDir and embedded manifest hash.

func NormalizeIntent

func NormalizeIntent(intent string, aliases map[string]string) string

NormalizeIntent normalizes one intent to canonical taxonomy label.

func NormalizeThresholds

func NormalizeThresholds(thresholds map[string]float64, aliases map[string]string) map[string]float64

NormalizeThresholds canonicalizes threshold keys with taxonomy aliases.

func SaveBundleManifest

func SaveBundleManifest(bundleDir string, manifest BundleManifest) error

SaveBundleManifest writes bundle manifest into bundle directory.

func SaveSamplesCSV

func SaveSamplesCSV(path string, samples []Sample) error

SaveSamplesCSV writes labeled samples into CSV with header text,intent.

Types

type BundleManifest

type BundleManifest struct {
	Version         string                        `json:"version"`
	CreatedAt       time.Time                     `json:"createdAt"`
	DefaultLanguage string                        `json:"defaultLanguage"`
	Corpus          SourceMetadata                `json:"corpus,omitempty"`
	TrainingParams  map[string]string             `json:"trainingParams,omitempty"`
	ModelSummary    map[string]BundleModelSummary `json:"modelSummary,omitempty"`
	Models          map[string]string             `json:"models"` // lang -> relative model directory
}

BundleManifest describes one multilingual model bundle.

func EmbeddedBundleManifest

func EmbeddedBundleManifest() (BundleManifest, error)

EmbeddedBundleManifest reads bundle manifest from embedded assets.

func LoadBundleManifest

func LoadBundleManifest(bundleDir string) (BundleManifest, error)

LoadBundleManifest reads manifest.json from a bundle directory.

type BundleModelSummary

type BundleModelSummary struct {
	Version             string  `json:"version,omitempty"`
	Language            string  `json:"language,omitempty"`
	TrainingSampleCount int     `json:"trainingSampleCount,omitempty"`
	TotalSampleCount    int     `json:"totalSampleCount,omitempty"`
	DefaultThreshold    float64 `json:"defaultThreshold,omitempty"`
	MacroF1             float64 `json:"macroF1,omitempty"`
}

BundleModelSummary stores high-level model metadata in bundle manifest.

type Candidate

type Candidate struct {
	Intent string  `json:"intent"`
	Score  float64 `json:"score"`
}

Candidate is one ranked prediction candidate.

type ClassMetrics

type ClassMetrics struct {
	Precision       float64 `json:"precision"`
	Recall          float64 `json:"recall"`
	F1              float64 `json:"f1"`
	Top1Recall      float64 `json:"top1Recall,omitempty"`
	Top3Recall      float64 `json:"top3Recall,omitempty"`
	Top5Recall      float64 `json:"top5Recall,omitempty"`
	Support         int     `json:"support"`
	TP              int     `json:"tp"`
	FP              int     `json:"fp"`
	FN              int     `json:"fn"`
	Top1CandidateTP int     `json:"top1CandidateTp,omitempty"`
	Top3CandidateTP int     `json:"top3CandidateTp,omitempty"`
	Top5CandidateTP int     `json:"top5CandidateTp,omitempty"`
}

ClassMetrics describes one intent evaluation result.

type DatasetSplitConfig

type DatasetSplitConfig struct {
	Enabled    bool    `json:"enabled"`
	TrainRatio float64 `json:"trainRatio"`
	ValRatio   float64 `json:"valRatio"`
	TestRatio  float64 `json:"testRatio"`
	Seed       int64   `json:"seed"`
}

DatasetSplitConfig controls deterministic train/val/test split.

func DefaultDatasetSplitConfig

func DefaultDatasetSplitConfig() DatasetSplitConfig

DefaultDatasetSplitConfig returns default split config.

type DeterministicRule

type DeterministicRule struct {
	ID          string   `json:"id"`
	Intent      string   `json:"intent"`
	Language    string   `json:"language,omitempty"`
	EqualsAny   []string `json:"equalsAny,omitempty"`
	PrefixAny   []string `json:"prefixAny,omitempty"`
	ContainsAny []string `json:"containsAny,omitempty"`
	Regex       string   `json:"regex,omitempty"`
	// contains filtered or unexported fields
}

DeterministicRule defines one pre-NLU deterministic route rule.

type Engine

type Engine struct {
	// contains filtered or unexported fields
}

Engine provides concurrent-safe prediction and hot reload.

func NewEngineFromDir

func NewEngineFromDir(modelDir string) (*Engine, error)

NewEngineFromDir loads model artifacts and creates a prediction engine.

func (*Engine) Language

func (e *Engine) Language() string

Language returns current model language.

func (*Engine) Meta added in v0.3.0

func (e *Engine) Meta() ModelMeta

Meta returns a deep copy of the current model metadata.

func (*Engine) Predict

func (e *Engine) Predict(_ context.Context, text string, opts PredictOptions) (Prediction, error)

Predict runs intent prediction.

func (*Engine) Reload

func (e *Engine) Reload(modelDir string) error

Reload atomically reloads model artifacts from the given directory.

func (*Engine) Version

func (e *Engine) Version() string

Version returns current model version.

type EvalReport

type EvalReport struct {
	Split       string                    `json:"split"`
	Samples     int                       `json:"samples"`
	Accuracy    float64                   `json:"accuracy"`
	MacroF1     float64                   `json:"macroF1"`
	MicroF1     float64                   `json:"microF1"`
	UnknownRate float64                   `json:"unknownRate"`
	PerIntent   map[string]ClassMetrics   `json:"perIntent"`
	Confusion   map[string]map[string]int `json:"confusion"`
}

EvalReport describes evaluation metrics for one split.

type HybridDecision

type HybridDecision struct {
	Route         string     `json:"route"`
	Intent        string     `json:"intent,omitempty"`
	RuleID        string     `json:"ruleId,omitempty"`
	Prediction    Prediction `json:"prediction"`
	ShouldCallLLM bool       `json:"shouldCallLLM"`
}

HybridDecision describes final routing decision.

type HybridPolicy

type HybridPolicy struct {
	Rules         []DeterministicRule
	Router        *Router
	Engine        *Engine
	UnknownIntent string
}

HybridPolicy combines deterministic rules + NLU + fallback. With PredictOptions.CandidateMode, NLU is used as a high-recall candidate generator and the final decision is left to the downstream LLM/tool planner.

func (*HybridPolicy) Decide

func (p *HybridPolicy) Decide(ctx context.Context, text string, opts PredictOptions) (HybridDecision, error)

Decide applies deterministic rules first, then NLU, then fallback.

func (*HybridPolicy) Prepare

func (p *HybridPolicy) Prepare() error

Prepare validates and compiles regex for hybrid rules.

type IntentDataSummary

type IntentDataSummary struct {
	Total int `json:"total"`
	Train int `json:"train"`
	Val   int `json:"val"`
	Test  int `json:"test"`
}

IntentDataSummary stores per-intent sample counts by split.

type Language

type Language string

Language identifies tokenizer/model language.

const (
	LanguageAuto Language = "auto"
	LanguageZH   Language = "zh"
	LanguageEN   Language = "en"
	LanguageJA   Language = "ja"
	LanguageKO   Language = "ko"
)

func DetectLanguage

func DetectLanguage(text string) Language

DetectLanguage returns a lightweight language guess for routing.

type LanguageDetection

type LanguageDetection struct {
	Language    Language
	Confidence  float64
	Reason      string
	LetterCount int
	ShortText   bool
}

LanguageDetection stores language detection result.

func DetectLanguageDetailed

func DetectLanguageDetailed(text string) LanguageDetection

DetectLanguageDetailed returns language detection with confidence and reason.

type ModelMeta

type ModelMeta struct {
	Version             string                `json:"version"`
	Language            string                `json:"language,omitempty"`
	UnknownIntent       string                `json:"unknownIntent"`
	DefaultThreshold    float64               `json:"defaultThreshold"`
	Thresholds          map[string]float64    `json:"thresholds,omitempty"`
	Classes             []string              `json:"classes"`
	CanonicalIntents    []string              `json:"canonicalIntents,omitempty"`
	IntentAliases       map[string]string     `json:"intentAliases,omitempty"`
	Tokenizer           TokenizerConfig       `json:"tokenizer"`
	TrainingSampleCount int                   `json:"trainingSampleCount"`
	CreatedAt           time.Time             `json:"createdAt"`
	Evaluation          map[string]EvalReport `json:"evaluation,omitempty"`
	Training            TrainingMetadata      `json:"training"`
	Source              SourceMetadata        `json:"source,omitempty"`
}

ModelMeta stores model metadata and inference policy.

type PredictOptions

type PredictOptions struct {
	TopK            int
	LanguageHint    string
	MinConfidence   float64 // if > 0, override model threshold for direct routing
	IgnoreThreshold bool
	// CandidateMode favors recall for LLM/tool-planner handoff. It keeps the
	// best intent accepted regardless of threshold while still returning TopK
	// candidates for downstream final selection.
	CandidateMode bool
}

PredictOptions controls prediction behavior.

type Prediction

type Prediction struct {
	Intent      string      `json:"intent"`
	Language    string      `json:"language,omitempty"`
	Confidence  float64     `json:"confidence"`
	Strict      bool        `json:"strict"`
	Matched     bool        `json:"matched"`
	Reason      string      `json:"reason,omitempty"`
	Tokens      []string    `json:"tokens,omitempty"`
	Candidates  []Candidate `json:"candidates,omitempty"`
	ModelVer    string      `json:"modelVersion,omitempty"`
	UnknownUsed bool        `json:"unknownUsed"`
}

Prediction is one inference result. Candidates are always the raw ranked intent hypotheses before threshold rejection, so callers can pass them to an LLM/tool planner even when Intent is unknown.

type Router

type Router struct {
	// contains filtered or unexported fields
}

Router routes inputs to language-specific intent engines.

func NewRouter

func NewRouter(defaultLang string) *Router

NewRouter creates an empty router.

func NewRouterFromBundle

func NewRouterFromBundle(bundleDir string) (*Router, error)

NewRouterFromBundle loads multilingual models from a bundle directory.

func NewRouterFromDirs

func NewRouterFromDirs(modelByLanguage map[string]string, defaultLang string) (*Router, error)

NewRouterFromDirs loads language models from directories.

func NewRouterFromEmbedded

func NewRouterFromEmbedded() (*Router, error)

NewRouterFromEmbedded loads router from embedded multilingual bundle assets.

func NewRouterFromEmbeddedIn

func NewRouterFromEmbeddedIn(cacheDir string) (*Router, error)

NewRouterFromEmbeddedIn loads router from embedded assets and extracts files under cacheDir. If cacheDir is empty, user cache directory (or system temp) is used.

func NewRouterWithOptions

func NewRouterWithOptions(defaultLang string, options RouterOptions) *Router

NewRouterWithOptions creates an empty router with options.

func (*Router) Languages

func (r *Router) Languages() []string

Languages returns loaded languages sorted alphabetically.

func (*Router) Load

func (r *Router) Load(language string, modelDir string) error

Load loads one language model directory into router.

func (*Router) Meta added in v0.3.0

func (r *Router) Meta() ModelMeta

Meta returns the default engine's model metadata, or empty if no engine is loaded.

func (*Router) Predict

func (r *Router) Predict(ctx context.Context, text string, opts PredictOptions) (Prediction, error)

Predict routes by language hint/detection and returns prediction.

type RouterOptions

type RouterOptions struct {
	AutoDetectMinConfidence     float64
	ShortTextRuneLimit          int
	EnableCrossLanguageFallback bool
}

RouterOptions controls auto language routing behavior.

func DefaultRouterOptions

func DefaultRouterOptions() RouterOptions

DefaultRouterOptions returns default router behavior.

type Sample

type Sample struct {
	Text   string
	Intent string
}

Sample is one supervised training sample.

func LoadSamplesCSV

func LoadSamplesCSV(path string) ([]Sample, error)

LoadSamplesCSV loads labeled samples from CSV. Expected columns: text,intent (header row optional).

func LoadSamplesCSVWithWarnings added in v0.3.0

func LoadSamplesCSVWithWarnings(path string) ([]Sample, []string, error)

LoadSamplesCSVWithWarnings loads samples and reports conflicting duplicates (same text mapped to different intents).

type SourceMetadata

type SourceMetadata struct {
	Name     string            `json:"name,omitempty"`
	Version  string            `json:"version,omitempty"`
	Revision string            `json:"revision,omitempty"`
	RepoURL  string            `json:"repoUrl,omitempty"`
	Commit   string            `json:"commit,omitempty"`
	Extra    map[string]string `json:"extra,omitempty"`
}

SourceMetadata describes training data source for reproducibility.

type TaxonomyConfig

type TaxonomyConfig struct {
	Enabled bool              `json:"enabled"`
	Aliases map[string]string `json:"aliases,omitempty"`
}

TaxonomyConfig controls intent canonicalization.

func DefaultTaxonomyConfig

func DefaultTaxonomyConfig() TaxonomyConfig

DefaultTaxonomyConfig returns default taxonomy config.

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer wraps language-specific tokenization with normalization and filtering.

func NewTokenizer

func NewTokenizer(cfg TokenizerConfig) (*Tokenizer, error)

NewTokenizer creates a tokenizer from config.

func (*Tokenizer) Config

func (t *Tokenizer) Config() TokenizerConfig

Config returns a copy of tokenizer config.

func (*Tokenizer) Normalize

func (t *Tokenizer) Normalize(text string) string

Normalize exposes text normalization for testing and dataset preprocessing.

func (*Tokenizer) Tokenize

func (t *Tokenizer) Tokenize(text string) []string

Tokenize tokenizes normalized text and filters invalid tokens.

type TokenizerConfig

type TokenizerConfig struct {
	Language      string   `json:"language,omitempty"`
	SearchMode    bool     `json:"searchMode"`
	HMM           bool     `json:"hmm"`
	Lowercase     bool     `json:"lowercase"`
	Stopwords     []string `json:"stopwords,omitempty"`
	CustomDicts   []string `json:"customDicts,omitempty"`
	MinTokenLen   int      `json:"minTokenLen"`
	StripPunct    bool     `json:"stripPunct"`
	CollapseSpace bool     `json:"collapseSpace"`
}

TokenizerConfig defines tokenizer behavior.

func DefaultTokenizerConfig

func DefaultTokenizerConfig() TokenizerConfig

DefaultTokenizerConfig returns a practical default tokenizer config.

type TrainConfig

type TrainConfig struct {
	Version                 string             `json:"version"`
	UnknownIntent           string             `json:"unknownIntent"`
	DefaultThreshold        float64            `json:"defaultThreshold"`
	Thresholds              map[string]float64 `json:"thresholds,omitempty"`
	Tokenizer               TokenizerConfig    `json:"tokenizer"`
	Split                   DatasetSplitConfig `json:"split"`
	AutoCalibrateThresholds bool               `json:"autoCalibrateThresholds"`
	Taxonomy                TaxonomyConfig     `json:"taxonomy"`
	Source                  SourceMetadata     `json:"source,omitempty"`
}

TrainConfig controls training behavior.

func DefaultTrainConfig

func DefaultTrainConfig() TrainConfig

DefaultTrainConfig returns a practical default training config.

type TrainedModel

type TrainedModel struct {
	// contains filtered or unexported fields
}

TrainedModel is an in-memory trained artifact.

func Train

func Train(samples []Sample, cfg TrainConfig) (*TrainedModel, error)

Train trains a bayesian model from labeled samples and produces metadata.

func (*TrainedModel) Meta

func (m *TrainedModel) Meta() ModelMeta

Meta returns a copy of model metadata.

func (*TrainedModel) SaveDir

func (m *TrainedModel) SaveDir(dir string) error

SaveDir persists the trained model into a directory.

type TrainingConfigSnapshot

type TrainingConfigSnapshot struct {
	DefaultThreshold        float64            `json:"defaultThreshold"`
	Thresholds              map[string]float64 `json:"thresholds,omitempty"`
	Tokenizer               TokenizerConfig    `json:"tokenizer"`
	Split                   DatasetSplitConfig `json:"split"`
	AutoCalibrateThresholds bool               `json:"autoCalibrateThresholds"`
	TaxonomyEnabled         bool               `json:"taxonomyEnabled"`
}

TrainingConfigSnapshot stores effective config used for training.

type TrainingMetadata

type TrainingMetadata struct {
	Seed                 int64                        `json:"seed"`
	TotalSampleCount     int                          `json:"totalSampleCount"`
	TrainSampleCount     int                          `json:"trainSampleCount"`
	ValSampleCount       int                          `json:"valSampleCount"`
	TestSampleCount      int                          `json:"testSampleCount"`
	Calibrated           bool                         `json:"calibrated"`
	CalibratedThresholds map[string]float64           `json:"calibratedThresholds,omitempty"`
	DataSummary          map[string]IntentDataSummary `json:"dataSummary,omitempty"`
	Config               TrainingConfigSnapshot       `json:"config"`
}

TrainingMetadata stores reproducibility and data summary info.

Directories

Path Synopsis
cmd
dataset

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL