intentnlu

package module

v0.3.2 Latest Latest Go to latest Published: Jun 13, 2026 License: MIT Imports: 25 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/godeps/intent-nlu

Links

Open Source Insights

README ¶

intent-nlu

Lightweight, embeddable intent classification engine for Go.

Module: github.com/godeps/intent-nlu
Chinese docs: README.zh-CN.md
Core stack:
- Tokenization: github.com/go-ego/gse (Chinese), normalized splitter for non-CJK
- Classifier: github.com/jbrukh/bayesian

What It Provides

Low-latency pre-LLM intent recognition (~1ms/request).
Skill routing for creative pipelines (video, image, audio, 3D, analysis).
Tool routing for operational intents (search, code, tasks, files, data, docs, etc.).
Deterministic train/val/test evaluation pipeline.
Per-intent threshold calibration (optional).
Intent taxonomy normalization (aliases -> canonical intents).
Multi-language routing (zh, en, extensible).
Hybrid policy (rules -> NLU -> fallback/LLM).
Data feedback loop for active dataset improvement.
Reproducibility metadata in model meta and bundle manifest.

Repository Layout

intent-nlu/
  cmd/
    intent-nlu-train/            # train one language model
    intent-nlu-predict/          # predict by single model / model map / bundle
    intent-nlu-bundle/           # build multilingual bundle from trained models
    intent-nlu-feedback/         # feedback ingestion and dataset/review update
  dataset/chatterbot/          # chatterbot corpus loader
  datasets/
    default/
      zh_business.csv          # business intents (calendar, weather, greeting)
      en_business.csv
      zh_skill_routing.csv     # skill routing intents (creative, analysis, chat)
      en_skill_routing.csv
      zh_tools_routing.csv     # tool routing intents (search, code, tasks, files, etc.)
      en_tools_routing.csv
      zh_tools_boost.csv       # supplemental short-phrase samples for tool intents
      en_tools_boost.csv
      zh_tools_boost2.csv      # targeted samples for weak intents (debug, analyze, search)
      en_tools_boost2.csv
    generated/
      *_train.csv              # effective training samples
      *_file_map.yaml          # auto-generated chatterbot mappings
      eval/*.json              # evaluation/model meta snapshots
    feedback/
      review/                  # low-confidence/unknown queue
      archive/                 # optional archived feedback
  docs/
    architecture.md
    skill-routing-integration.md
  examples/
    file_intent_map.yaml
  models/
    model-zh/
    model-en/
    multilingual/
  scripts/
    train_chatterbot_models.sh
    feedback_loop.sh

  # core package
  types.go
  tokenizer.go
  language.go
  taxonomy.go
  evaluation.go
  trainer.go
  model.go
  engine.go
  router.go
  router_bundle.go
  hybrid_policy.go
  embedded_bundle.go

Quick Start

1) Run tests

make test

2) Reproducible local evaluation (`make eval`)

make eval

Outputs:

datasets/generated/eval/zh_eval.json
datasets/generated/eval/en_eval.json

Both reports include split metrics (accuracy, macro-F1, micro-F1, confusion matrix, per-intent metrics) and training metadata.

3) One-click corpus training (`zh,en` default)

make train

Outputs:

models/model-zh, models/model-en
models/multilingual
datasets/generated/{zh,en}_train.csv
datasets/generated/eval/{zh,en}_meta.json

Intent Classes

Default embedded models include business, skill routing, tool routing, and chitchat intents:

zh: 20 canonical classes
en: 20 canonical classes
fallback: unknown

Skill Routing Intents

Intent	Description	ZH Samples	EN Samples
`creative_video`	Video production, editing, TVC, ads	200	200
`creative_image`	Image generation, poster, illustration	200	203
`creative_audio`	Music, sound effects, audio production	100	100
`creative_3d`	3D modeling, rendering	80	80
`media_analysis`	Video/image understanding, description	100	100
`general_chat`	Chitchat, questions, non-creative tasks	300	300

Tool Routing Intents

Map user intent to saker builtin tools and operational skills:

Intent	Description	Saker Tools	ZH F1	EN F1
`web_search`	Search the web, look up docs, find info	web_search, web_fetch, browser	0.57	0.61
`coding_assist`	Write/debug/run code, fix bugs, tests	bash, edit, read, write, grep, glob	0.69	0.72
`task_management`	Create/list/update tasks, kanban board	task_, kanban_	0.90	0.93
`file_operation`	Download, read, write, save files	fetch_file, read, write	0.88	1.00
`knowledge_qa`	Recall past decisions, remember context	memory_read, recall_context	0.96	0.88
`workflow_automation`	Schedule cron jobs, automate pipelines	workflow, cron, loop	0.53	0.67
`data_analysis`	Analyze metrics, compute stats, find patterns	bash, canvas_table_write	0.78	0.88
`document_creation`	Write docs, README, guides, reports	canvas_create_node, write	0.71	0.89
`translation`	Translate text between languages	— (common user need)	0.86	0.93
`summarization`	Summarize logs, reports, discussions	video_summarizer	0.67	0.80

Business Intents

Intent	Description	zh	en
`calendar_info`	Date, weekday, holiday, lunar/solar calendar queries	Yes	Yes
`weather_info`	Weather forecast, rain, temperature	Yes	Yes
`chitchat_greeting`	Direct short greetings (hi/hello/good morning)	Yes	Yes

Chitchat Intents

Intent	Description	zh	en
`chitchat_greetings`	Greeting variants from chatterbot corpus	Yes	Yes
`chitchat_ai`	AI/assistant topic small talk	Yes	Yes
`chitchat_botprofile`	Bot identity/capabilities/preferences	Yes	Yes
`chitchat_conversations`	Generic open-domain conversation	Yes	Yes
`chitchat_emotion`	Emotion/mood/support style casual talk	Yes	Yes
`chitchat_food`	Food and drink discussion	Yes	Yes
`chitchat_gossip`	Gossip/celebrity/light social topics	Yes	Yes
`chitchat_history`	History trivia chat	Yes	Yes
`chitchat_humor`	Jokes/funny content	Yes	Yes
`chitchat_literature`	Books/writing/literature chat	Yes	Yes
`chitchat_money`	Money/finance-light casual talk	Yes	Yes
`chitchat_movies`	Movies/entertainment chat	Yes	Yes
`chitchat_politics`	Politics social chat	Yes	Yes
`chitchat_psychology`	Psychology/personality casual topics	Yes	Yes
`chitchat_science`	Science trivia chat	Yes	Yes
`chitchat_sports`	Sports chat	Yes	Yes
`chitchat_trivia`	General trivia/knowledge snippets	Yes	Yes
`chitchat_coding`	Programming/dev casual topics	No	Yes
`chitchat_computers`	Computer/device/software casual topics	No	Yes
`chitchat_health`	Health/wellness casual topics	No	Yes
`chitchat_tech_support`	Light technical support style chat	No	Yes

Taxonomy Aliases

Applied at inference time via NormalizeIntent():

// Creative routing
"video_production"  -> "creative_video"
"video_editing"     -> "creative_video"
"film_production"   -> "creative_video"
"ad_production"     -> "creative_video"
"image_generation"  -> "creative_image"
"poster_design"     -> "creative_image"
"music_creation"    -> "creative_audio"
"audio_production"  -> "creative_audio"
"3d_modeling"       -> "creative_3d"
"video_analysis"    -> "media_analysis"
"image_analysis"    -> "media_analysis"

// Tool routing
"search"            -> "web_search"
"internet_search"   -> "web_search"
"lookup"            -> "web_search"
"google"            -> "web_search"
"code"              -> "coding_assist"
"programming"       -> "coding_assist"
"debug"             -> "coding_assist"
"fix_code"          -> "coding_assist"
"write_code"        -> "coding_assist"
"task"              -> "task_management"
"todo"              -> "task_management"
"kanban"            -> "task_management"
"download"          -> "file_operation"
"upload"            -> "file_operation"
"read_file"         -> "file_operation"
"save_file"         -> "file_operation"
"recall"            -> "knowledge_qa"
"remember"          -> "knowledge_qa"
"schedule"          -> "workflow_automation"
"cron"              -> "workflow_automation"
"automate"          -> "workflow_automation"
"analyze_data"      -> "data_analysis"
"statistics"        -> "data_analysis"
"metrics"           -> "data_analysis"
"create_doc"        -> "document_creation"
"write_doc"         -> "document_creation"
"documentation"     -> "document_creation"
"translate"         -> "translation"
"localize"          -> "translation"
"summarize"         -> "summarization"
"tldr"              -> "summarization"
"digest"            -> "summarization"

Supported Languages

Language	Code	Default Embedded Model	Auto Detect	Notes
Chinese	`zh`	Yes	Yes	`gse` tokenizer, 20 canonical classes
English	`en`	Yes	Yes	normalized tokenizer, 20 canonical classes
Japanese	`ja`	No (train yourself)	Yes	language detection supported
Korean	`ko`	No (train yourself)	Yes	language detection supported

Routing Modes and Thresholds

Mode	Suggested Settings	Behavior
Candidate handoff to LLM/tool planner	`CandidateMode: true`, `TopK: 3-5`	Prioritizes recall; possible tools remain visible for final LLM selection
Direct tool execution	`0.75 - 0.85` threshold	Prioritizes precision; only high-confidence intents execute without LLM review
Business routing (balanced)	`0.60 - 0.70` threshold	Good default for deterministic intent dispatch
Default baseline	`0.55` threshold	Current training default if no per-intent override
Uncertain direct route	`< threshold` => `unknown`	Keep `Candidates`; route to fallback or LLM

Operational Notes

Topic	Risk	Recommendation
Fine-grained classes	Confusion across similar chitchat intents	Keep enough per-intent samples and evaluate confusion matrix
Corpus bias	chatterbot data is mostly chitchat	Always mix business/skill routing CSV for production tasks
Multilingual routing	Short/mixed text can route wrong language	Use language hint for critical paths
Threshold drift	Retraining changes confidence distribution	Re-calibrate thresholds and compare `eval/*.json` every release
Tool candidate loss	High thresholds can hide plausible tools from LLM planners	Use `CandidateMode` and monitor TopK candidate recall
Embedded bundle updates	New models require dependency rebuild	Pin model version and release notes with each update

Training Workflows

A) One-click script (recommended)

./scripts/train_chatterbot_models.sh \
  --langs zh,en \
  --threshold 0.55 \
  --split-enabled true \
  --train-ratio 0.8 \
  --val-ratio 0.1 \
  --test-ratio 0.1 \
  --seed 42 \
  --auto-calibrate true \
  --merge-bundle true \
  --bundle-dir ./models/multilingual

What it does:

Clone/update chatterbot-corpus.
Auto-generate file->intent mapping (chitchat_<file>).
Auto-discover and merge all datasets/default/<lang>_*.csv files (business, skill routing, tool routing, boost data).
Train models with split/evaluation/calibration.
Build multilingual bundle via cmd/intent-nlu-bundle.

B) Manual training CLI

GOWORK=off go run ./cmd/intent-nlu-train \
  -lang zh \
  -corpus-root /path/to/chatterbot_corpus/data/chinese \
  -file-map ./examples/file_intent_map.yaml \
  -extra-csv ./datasets/default/zh_business.csv,./datasets/default/zh_skill_routing.csv \
  -dump-samples ./datasets/generated/zh_train.csv \
  -eval-report ./datasets/generated/eval/zh_meta.json \
  -out ./models/model-zh \
  -version 2026.05.31.zh.1 \
  -threshold 0.55 \
  -split-enabled=true \
  -train-ratio 0.8 \
  -val-ratio 0.1 \
  -test-ratio 0.1 \
  -seed 42 \
  -auto-calibrate-thresholds=true

Important flags:

Data: -corpus-root, -file-map, -category-map, -extra-csv (comma-separated multi-file)
Split/eval: -split-enabled, -train-ratio, -val-ratio, -test-ratio, -seed, -eval-report
Threshold: -threshold, -thresholds, -auto-calibrate-thresholds
Taxonomy: -disable-taxonomy (default true), -taxonomy-aliases
Reproducibility source: -source-name, -source-version, -source-revision, -source-repo-url, -source-commit

Bundle Build CLI

GOWORK=off go run ./cmd/intent-nlu-bundle \
  -bundle-dir ./models/multilingual \
  -models "zh=./models/model-zh,en=./models/model-en" \
  -default-lang zh \
  -version 2026.05.31.bundle.1 \
  -corpus-repo-url https://github.com/gunthercox/chatterbot-corpus.git \
  -corpus-commit <commit> \
  -training-params "seed=42,train_ratio=0.8,val_ratio=0.1,test_ratio=0.1"

Prediction

Single model

GOWORK=off go run ./cmd/intent-nlu-predict \
  -model ./models/model-zh \
  -text "帮我做个产品视频" \
  -lang auto \
  -topk 3

Recall-first candidate output for LLM/tool-planner selection:

GOWORK=off go run ./cmd/intent-nlu-predict \
  -bundle ./models/multilingual \
  -text "maybe look this up and summarize it" \
  -lang auto \
  -topk 5 \
  -candidate-mode

Multi-model map

GOWORK=off go run ./cmd/intent-nlu-predict \
  -models "zh=./models/model-zh,en=./models/model-en" \
  -text "create a poster" \
  -lang auto

Bundle

GOWORK=off go run ./cmd/intent-nlu-predict \
  -bundle ./models/multilingual \
  -text "做一个3D模型" \
  -lang auto

No model flags (use embedded default bundle)

GOWORK=off go run ./cmd/intent-nlu-predict \
  -text "analyze this video" \
  -lang auto

If -bundle, -models, and -model are all omitted, the command loads embedded default models.

Embedded Bundle (for dependency consumers)

intent-nlu embeds default multilingual models into the package. When another Go service imports this module, it can load models without shipping external files.

import intentnlu "github.com/godeps/intent-nlu"

router, err := intentnlu.NewRouterFromEmbedded()
if err != nil {
    panic(err)
}

pred, err := router.Predict(context.Background(), "帮我画一张海报", intentnlu.PredictOptions{
    TopK:         3,
    LanguageHint: "zh",
})
// pred.Intent == "creative_image", pred.Confidence == 0.96

Optional custom extraction cache directory:

router, err := intentnlu.NewRouterFromEmbeddedIn("./.cache/intent-nlu")

Feedback Loop

Use model feedback CSV to:

Append human-labeled rows into datasets/default/<lang>_business.csv
Put low-confidence/unknown rows into review queue

./scripts/feedback_loop.sh --input ./tmp/feedback.csv

Supported CSV headers:

required: text
optional aliases:
- language: language
- predicted intent: pred_intent|intent|predicted_intent
- score: confidence|score
- human label: final_intent|human_intent|label

Package Usage

Engine

engine, err := intentnlu.NewEngineFromDir("./models/model-zh")
if err != nil {
    panic(err)
}

pred, err := engine.Predict(context.Background(), "做一段背景音乐", intentnlu.PredictOptions{
    TopK:         3,
    LanguageHint: "auto",
})
// pred.Intent == "creative_audio"

Router

router, err := intentnlu.NewRouterFromBundle("./models/multilingual")
if err != nil {
    panic(err)
}

pred, err := router.Predict(context.Background(), "create a 3D model", intentnlu.PredictOptions{
    TopK:         3,
    LanguageHint: "auto",
})
// pred.Intent == "creative_3d"

For LLM/tool-planner handoff, prefer recall-first candidates:

pred, err := router.Predict(context.Background(), userText, intentnlu.PredictOptions{
    TopK:          5,
    LanguageHint:  "auto",
    CandidateMode: true,
})
// pred.Candidates contains the ranked possible tools/intents even when a
// direct-routing threshold would have rejected the top intent.

Hybrid Policy (rules -> NLU -> candidate/fallback)

policy := &intentnlu.HybridPolicy{
    Router: router,
    Rules: []intentnlu.DeterministicRule{
        {ID: "r1", Intent: "video_production", ContainsAny: []string{"tvc", "宣传片制作"}},
    },
}
_ = policy.Prepare() // taxonomy normalizes: "video_production" -> "creative_video"

decision, err := policy.Decide(context.Background(), userText, intentnlu.PredictOptions{
    TopK:          5,
    CandidateMode: true,
})
// decision.Route: rule | nlu | candidate | fallback
// decision.ShouldCallLLM tells whether to continue into LLM

Notes

chatterbot-corpus is mostly chitchat; business and skill routing intents need curated data.
Multilingual bundle is a packaging format, not one fused multilingual classifier.
For LLM final selection, optimize TopK candidate recall before single-label precision.
Generated artifacts can grow quickly; plan storage strategy by environment.

Commands Summary

make test                                  # run all tests
make eval                                  # reproducible evaluation (CSV only)
make train                                 # full training + bundling
./scripts/feedback_loop.sh --input <csv>   # feedback data ingestion

Documentation ¶

Index ¶

Constants
func DefaultIntentAliases() map[string]string
func ExtractEmbeddedBundle(cacheDir string) (string, error)
func NormalizeIntent(intent string, aliases map[string]string) string
func NormalizeThresholds(thresholds map[string]float64, aliases map[string]string) map[string]float64
func SaveBundleManifest(bundleDir string, manifest BundleManifest) error
func SaveSamplesCSV(path string, samples []Sample) error
type BundleManifest
- func EmbeddedBundleManifest() (BundleManifest, error)
- func LoadBundleManifest(bundleDir string) (BundleManifest, error)
type BundleModelSummary
type Candidate
type ClassMetrics
type DatasetSplitConfig
- func DefaultDatasetSplitConfig() DatasetSplitConfig
type DeterministicRule
type Engine
- func NewEngineFromDir(modelDir string) (*Engine, error)
- func (e *Engine) Language() string
- func (e *Engine) Meta() ModelMeta
- func (e *Engine) Predict(_ context.Context, text string, opts PredictOptions) (Prediction, error)
- func (e *Engine) Reload(modelDir string) error
- func (e *Engine) Version() string
type EvalReport
type HybridDecision
type HybridPolicy
- func (p *HybridPolicy) Decide(ctx context.Context, text string, opts PredictOptions) (HybridDecision, error)
- func (p *HybridPolicy) Prepare() error
type IntentDataSummary
type Language
- func DetectLanguage(text string) Language
type LanguageDetection
- func DetectLanguageDetailed(text string) LanguageDetection
type ModelMeta
type PredictOptions
type Prediction
type Router
- func NewRouter(defaultLang string) *Router
- func NewRouterFromBundle(bundleDir string) (*Router, error)
- func NewRouterFromDirs(modelByLanguage map[string]string, defaultLang string) (*Router, error)
- func NewRouterFromEmbedded() (*Router, error)
- func NewRouterFromEmbeddedIn(cacheDir string) (*Router, error)
- func NewRouterWithOptions(defaultLang string, options RouterOptions) *Router
- func (r *Router) Languages() []string
- func (r *Router) Load(language string, modelDir string) error
- func (r *Router) Meta() ModelMeta
- func (r *Router) Predict(ctx context.Context, text string, opts PredictOptions) (Prediction, error)
type RouterOptions
- func DefaultRouterOptions() RouterOptions
type Sample
- func LoadSamplesCSV(path string) ([]Sample, error)
- func LoadSamplesCSVWithWarnings(path string) ([]Sample, []string, error)
type SourceMetadata
type TaxonomyConfig
- func DefaultTaxonomyConfig() TaxonomyConfig
type Tokenizer
- func NewTokenizer(cfg TokenizerConfig) (*Tokenizer, error)
- func (t *Tokenizer) Config() TokenizerConfig
- func (t *Tokenizer) Normalize(text string) string
- func (t *Tokenizer) Tokenize(text string) []string
type TokenizerConfig
- func DefaultTokenizerConfig() TokenizerConfig
type TrainConfig
- func DefaultTrainConfig() TrainConfig
type TrainedModel
- func Train(samples []Sample, cfg TrainConfig) (*TrainedModel, error)
- func (m *TrainedModel) Meta() ModelMeta
- func (m *TrainedModel) SaveDir(dir string) error
type TrainingConfigSnapshot
type TrainingMetadata

Constants ¶

View Source

const (
	HybridRouteRule      = "rule"
	HybridRouteNLU       = "nlu"
	HybridRouteCandidate = "candidate"
	HybridRouteFallback  = "fallback"
)

View Source

const (
	// DefaultUnknownIntent is returned when confidence is below threshold.
	DefaultUnknownIntent = "unknown"
	// ModelBinaryFile is the bayesian model artifact file name.
	ModelBinaryFile = "model.gob"
	// ModelMetaFile is the metadata artifact file name.
	ModelMetaFile = "meta.json"
)

View Source

const (
	// BundleManifestFileName is the default file name of multilingual router bundle manifest.
	BundleManifestFileName = "manifest.json"
)

Variables ¶

This section is empty.

Functions ¶

func DefaultIntentAliases ¶

func DefaultIntentAliases() map[string]string

DefaultIntentAliases returns a copy of the stable intent taxonomy aliases.

func ExtractEmbeddedBundle ¶

func ExtractEmbeddedBundle(cacheDir string) (string, error)

ExtractEmbeddedBundle extracts embedded multilingual bundle to local filesystem and returns bundle directory. Extraction is idempotent for the same cacheDir and embedded manifest hash.

func NormalizeIntent ¶

func NormalizeIntent(intent string, aliases map[string]string) string

NormalizeIntent normalizes one intent to canonical taxonomy label.

func NormalizeThresholds ¶

func NormalizeThresholds(thresholds map[string]float64, aliases map[string]string) map[string]float64

NormalizeThresholds canonicalizes threshold keys with taxonomy aliases.

func SaveBundleManifest ¶

func SaveBundleManifest(bundleDir string, manifest BundleManifest) error

SaveBundleManifest writes bundle manifest into bundle directory.

func SaveSamplesCSV ¶

func SaveSamplesCSV(path string, samples []Sample) error

SaveSamplesCSV writes labeled samples into CSV with header text,intent.

Types ¶

type BundleManifest ¶

type BundleManifest struct {
	Version         string                        `json:"version"`
	CreatedAt       time.Time                     `json:"createdAt"`
	DefaultLanguage string                        `json:"defaultLanguage"`
	Corpus          SourceMetadata                `json:"corpus,omitempty"`
	TrainingParams  map[string]string             `json:"trainingParams,omitempty"`
	ModelSummary    map[string]BundleModelSummary `json:"modelSummary,omitempty"`
	Models          map[string]string             `json:"models"` // lang -> relative model directory
}

BundleManifest describes one multilingual model bundle.

func EmbeddedBundleManifest ¶

func EmbeddedBundleManifest() (BundleManifest, error)

EmbeddedBundleManifest reads bundle manifest from embedded assets.

func LoadBundleManifest ¶

func LoadBundleManifest(bundleDir string) (BundleManifest, error)

LoadBundleManifest reads manifest.json from a bundle directory.

type BundleModelSummary ¶

type BundleModelSummary struct {
	Version             string  `json:"version,omitempty"`
	Language            string  `json:"language,omitempty"`
	TrainingSampleCount int     `json:"trainingSampleCount,omitempty"`
	TotalSampleCount    int     `json:"totalSampleCount,omitempty"`
	DefaultThreshold    float64 `json:"defaultThreshold,omitempty"`
	MacroF1             float64 `json:"macroF1,omitempty"`
}

BundleModelSummary stores high-level model metadata in bundle manifest.

type Candidate ¶

type Candidate struct {
	Intent string  `json:"intent"`
	Score  float64 `json:"score"`
}

Candidate is one ranked prediction candidate.

type ClassMetrics ¶

type ClassMetrics struct {
	Precision       float64 `json:"precision"`
	Recall          float64 `json:"recall"`
	F1              float64 `json:"f1"`
	Top1Recall      float64 `json:"top1Recall,omitempty"`
	Top3Recall      float64 `json:"top3Recall,omitempty"`
	Top5Recall      float64 `json:"top5Recall,omitempty"`
	Support         int     `json:"support"`
	TP              int     `json:"tp"`
	FP              int     `json:"fp"`
	FN              int     `json:"fn"`
	Top1CandidateTP int     `json:"top1CandidateTp,omitempty"`
	Top3CandidateTP int     `json:"top3CandidateTp,omitempty"`
	Top5CandidateTP int     `json:"top5CandidateTp,omitempty"`
}

ClassMetrics describes one intent evaluation result.

type DatasetSplitConfig ¶

type DatasetSplitConfig struct {
	Enabled    bool    `json:"enabled"`
	TrainRatio float64 `json:"trainRatio"`
	ValRatio   float64 `json:"valRatio"`
	TestRatio  float64 `json:"testRatio"`
	Seed       int64   `json:"seed"`
}

DatasetSplitConfig controls deterministic train/val/test split.

func DefaultDatasetSplitConfig ¶

func DefaultDatasetSplitConfig() DatasetSplitConfig

DefaultDatasetSplitConfig returns default split config.

type DeterministicRule ¶

type DeterministicRule struct {
	ID          string   `json:"id"`
	Intent      string   `json:"intent"`
	Language    string   `json:"language,omitempty"`
	EqualsAny   []string `json:"equalsAny,omitempty"`
	PrefixAny   []string `json:"prefixAny,omitempty"`
	ContainsAny []string `json:"containsAny,omitempty"`
	Regex       string   `json:"regex,omitempty"`
	// contains filtered or unexported fields
}

DeterministicRule defines one pre-NLU deterministic route rule.

type Engine ¶

type Engine struct {
	// contains filtered or unexported fields
}

Engine provides concurrent-safe prediction and hot reload.

func NewEngineFromDir ¶

func NewEngineFromDir(modelDir string) (*Engine, error)

NewEngineFromDir loads model artifacts and creates a prediction engine.

func (*Engine) Language ¶

func (e *Engine) Language() string

Language returns current model language.

func (*Engine) Meta ¶ added in v0.3.0

func (e *Engine) Meta() ModelMeta

Meta returns a deep copy of the current model metadata.

func (*Engine) Predict ¶

func (e *Engine) Predict(_ context.Context, text string, opts PredictOptions) (Prediction, error)

Predict runs intent prediction.

func (*Engine) Reload ¶

func (e *Engine) Reload(modelDir string) error

Reload atomically reloads model artifacts from the given directory.

func (*Engine) Version ¶

func (e *Engine) Version() string

Version returns current model version.

type EvalReport ¶

type EvalReport struct {
	Split       string                    `json:"split"`
	Samples     int                       `json:"samples"`
	Accuracy    float64                   `json:"accuracy"`
	MacroF1     float64                   `json:"macroF1"`
	MicroF1     float64                   `json:"microF1"`
	UnknownRate float64                   `json:"unknownRate"`
	PerIntent   map[string]ClassMetrics   `json:"perIntent"`
	Confusion   map[string]map[string]int `json:"confusion"`
}

EvalReport describes evaluation metrics for one split.

type HybridDecision ¶

type HybridDecision struct {
	Route         string     `json:"route"`
	Intent        string     `json:"intent,omitempty"`
	RuleID        string     `json:"ruleId,omitempty"`
	Prediction    Prediction `json:"prediction"`
	ShouldCallLLM bool       `json:"shouldCallLLM"`
}

HybridDecision describes final routing decision.

type HybridPolicy ¶

type HybridPolicy struct {
	Rules         []DeterministicRule
	Router        *Router
	Engine        *Engine
	UnknownIntent string
}

HybridPolicy combines deterministic rules + NLU + fallback. With PredictOptions.CandidateMode, NLU is used as a high-recall candidate generator and the final decision is left to the downstream LLM/tool planner.

func (*HybridPolicy) Decide ¶

func (p *HybridPolicy) Decide(ctx context.Context, text string, opts PredictOptions) (HybridDecision, error)

Decide applies deterministic rules first, then NLU, then fallback.

func (*HybridPolicy) Prepare ¶

func (p *HybridPolicy) Prepare() error

Prepare validates and compiles regex for hybrid rules.

type IntentDataSummary ¶

type IntentDataSummary struct {
	Total int `json:"total"`
	Train int `json:"train"`
	Val   int `json:"val"`
	Test  int `json:"test"`
}

IntentDataSummary stores per-intent sample counts by split.

type Language ¶

type Language string

Language identifies tokenizer/model language.

const (
	LanguageAuto Language = "auto"
	LanguageZH   Language = "zh"
	LanguageEN   Language = "en"
	LanguageJA   Language = "ja"
	LanguageKO   Language = "ko"
)

func DetectLanguage ¶

func DetectLanguage(text string) Language

DetectLanguage returns a lightweight language guess for routing.

type LanguageDetection ¶

type LanguageDetection struct {
	Language    Language
	Confidence  float64
	Reason      string
	LetterCount int
	ShortText   bool
}

LanguageDetection stores language detection result.

func DetectLanguageDetailed ¶

func DetectLanguageDetailed(text string) LanguageDetection

DetectLanguageDetailed returns language detection with confidence and reason.

type ModelMeta ¶

type ModelMeta struct {
	Version             string                `json:"version"`
	Language            string                `json:"language,omitempty"`
	UnknownIntent       string                `json:"unknownIntent"`
	DefaultThreshold    float64               `json:"defaultThreshold"`
	Thresholds          map[string]float64    `json:"thresholds,omitempty"`
	Classes             []string              `json:"classes"`
	CanonicalIntents    []string              `json:"canonicalIntents,omitempty"`
	IntentAliases       map[string]string     `json:"intentAliases,omitempty"`
	Tokenizer           TokenizerConfig       `json:"tokenizer"`
	TrainingSampleCount int                   `json:"trainingSampleCount"`
	CreatedAt           time.Time             `json:"createdAt"`
	Evaluation          map[string]EvalReport `json:"evaluation,omitempty"`
	Training            TrainingMetadata      `json:"training"`
	Source              SourceMetadata        `json:"source,omitempty"`
}

ModelMeta stores model metadata and inference policy.

type PredictOptions ¶

type PredictOptions struct {
	TopK            int
	LanguageHint    string
	MinConfidence   float64 // if > 0, override model threshold for direct routing
	IgnoreThreshold bool
	// CandidateMode favors recall for LLM/tool-planner handoff. It keeps the
	// best intent accepted regardless of threshold while still returning TopK
	// candidates for downstream final selection.
	CandidateMode bool
}

PredictOptions controls prediction behavior.

type Prediction ¶

type Prediction struct {
	Intent      string      `json:"intent"`
	Language    string      `json:"language,omitempty"`
	Confidence  float64     `json:"confidence"`
	Strict      bool        `json:"strict"`
	Matched     bool        `json:"matched"`
	Reason      string      `json:"reason,omitempty"`
	Tokens      []string    `json:"tokens,omitempty"`
	Candidates  []Candidate `json:"candidates,omitempty"`
	ModelVer    string      `json:"modelVersion,omitempty"`
	UnknownUsed bool        `json:"unknownUsed"`
}

Prediction is one inference result. Candidates are always the raw ranked intent hypotheses before threshold rejection, so callers can pass them to an LLM/tool planner even when Intent is unknown.

type Router ¶

type Router struct {
	// contains filtered or unexported fields
}

Router routes inputs to language-specific intent engines.

func NewRouter ¶

func NewRouter(defaultLang string) *Router

NewRouter creates an empty router.

func NewRouterFromBundle ¶

func NewRouterFromBundle(bundleDir string) (*Router, error)

NewRouterFromBundle loads multilingual models from a bundle directory.

func NewRouterFromDirs ¶

func NewRouterFromDirs(modelByLanguage map[string]string, defaultLang string) (*Router, error)

NewRouterFromDirs loads language models from directories.

func NewRouterFromEmbedded ¶

func NewRouterFromEmbedded() (*Router, error)

NewRouterFromEmbedded loads router from embedded multilingual bundle assets.

func NewRouterFromEmbeddedIn ¶

func NewRouterFromEmbeddedIn(cacheDir string) (*Router, error)

NewRouterFromEmbeddedIn loads router from embedded assets and extracts files under cacheDir. If cacheDir is empty, user cache directory (or system temp) is used.

func NewRouterWithOptions ¶

func NewRouterWithOptions(defaultLang string, options RouterOptions) *Router

NewRouterWithOptions creates an empty router with options.

func (*Router) Languages ¶

func (r *Router) Languages() []string

Languages returns loaded languages sorted alphabetically.

func (*Router) Load ¶

func (r *Router) Load(language string, modelDir string) error

Load loads one language model directory into router.

func (*Router) Meta ¶ added in v0.3.0

func (r *Router) Meta() ModelMeta

Meta returns the default engine's model metadata, or empty if no engine is loaded.

func (*Router) Predict ¶

func (r *Router) Predict(ctx context.Context, text string, opts PredictOptions) (Prediction, error)

Predict routes by language hint/detection and returns prediction.

type RouterOptions ¶

type RouterOptions struct {
	AutoDetectMinConfidence     float64
	ShortTextRuneLimit          int
	EnableCrossLanguageFallback bool
}

RouterOptions controls auto language routing behavior.

func DefaultRouterOptions ¶

func DefaultRouterOptions() RouterOptions

DefaultRouterOptions returns default router behavior.

type Sample ¶

type Sample struct {
	Text   string
	Intent string
}

Sample is one supervised training sample.

func LoadSamplesCSV ¶

func LoadSamplesCSV(path string) ([]Sample, error)

LoadSamplesCSV loads labeled samples from CSV. Expected columns: text,intent (header row optional).

func LoadSamplesCSVWithWarnings ¶ added in v0.3.0

func LoadSamplesCSVWithWarnings(path string) ([]Sample, []string, error)

LoadSamplesCSVWithWarnings loads samples and reports conflicting duplicates (same text mapped to different intents).

type SourceMetadata ¶

type SourceMetadata struct {
	Name     string            `json:"name,omitempty"`
	Version  string            `json:"version,omitempty"`
	Revision string            `json:"revision,omitempty"`
	RepoURL  string            `json:"repoUrl,omitempty"`
	Commit   string            `json:"commit,omitempty"`
	Extra    map[string]string `json:"extra,omitempty"`
}

SourceMetadata describes training data source for reproducibility.

type TaxonomyConfig ¶

type TaxonomyConfig struct {
	Enabled bool              `json:"enabled"`
	Aliases map[string]string `json:"aliases,omitempty"`
}

TaxonomyConfig controls intent canonicalization.

func DefaultTaxonomyConfig ¶

func DefaultTaxonomyConfig() TaxonomyConfig

DefaultTaxonomyConfig returns default taxonomy config.

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer wraps language-specific tokenization with normalization and filtering.

func NewTokenizer ¶

func NewTokenizer(cfg TokenizerConfig) (*Tokenizer, error)

NewTokenizer creates a tokenizer from config.

func (*Tokenizer) Config ¶

func (t *Tokenizer) Config() TokenizerConfig

Config returns a copy of tokenizer config.

func (*Tokenizer) Normalize ¶

func (t *Tokenizer) Normalize(text string) string

Normalize exposes text normalization for testing and dataset preprocessing.

func (*Tokenizer) Tokenize ¶

func (t *Tokenizer) Tokenize(text string) []string

Tokenize tokenizes normalized text and filters invalid tokens.

type TokenizerConfig ¶

type TokenizerConfig struct {
	Language      string   `json:"language,omitempty"`
	SearchMode    bool     `json:"searchMode"`
	HMM           bool     `json:"hmm"`
	Lowercase     bool     `json:"lowercase"`
	Stopwords     []string `json:"stopwords,omitempty"`
	CustomDicts   []string `json:"customDicts,omitempty"`
	MinTokenLen   int      `json:"minTokenLen"`
	StripPunct    bool     `json:"stripPunct"`
	CollapseSpace bool     `json:"collapseSpace"`
}

TokenizerConfig defines tokenizer behavior.

func DefaultTokenizerConfig ¶

func DefaultTokenizerConfig() TokenizerConfig

DefaultTokenizerConfig returns a practical default tokenizer config.

type TrainConfig ¶

type TrainConfig struct {
	Version                 string             `json:"version"`
	UnknownIntent           string             `json:"unknownIntent"`
	DefaultThreshold        float64            `json:"defaultThreshold"`
	Thresholds              map[string]float64 `json:"thresholds,omitempty"`
	Tokenizer               TokenizerConfig    `json:"tokenizer"`
	Split                   DatasetSplitConfig `json:"split"`
	AutoCalibrateThresholds bool               `json:"autoCalibrateThresholds"`
	Taxonomy                TaxonomyConfig     `json:"taxonomy"`
	Source                  SourceMetadata     `json:"source,omitempty"`
}

TrainConfig controls training behavior.

func DefaultTrainConfig ¶

func DefaultTrainConfig() TrainConfig

DefaultTrainConfig returns a practical default training config.

type TrainedModel ¶

type TrainedModel struct {
	// contains filtered or unexported fields
}

TrainedModel is an in-memory trained artifact.

func Train ¶

func Train(samples []Sample, cfg TrainConfig) (*TrainedModel, error)

Train trains a bayesian model from labeled samples and produces metadata.

func (*TrainedModel) Meta ¶

func (m *TrainedModel) Meta() ModelMeta

Meta returns a copy of model metadata.

func (*TrainedModel) SaveDir ¶

func (m *TrainedModel) SaveDir(dir string) error

SaveDir persists the trained model into a directory.

type TrainingConfigSnapshot ¶

type TrainingConfigSnapshot struct {
	DefaultThreshold        float64            `json:"defaultThreshold"`
	Thresholds              map[string]float64 `json:"thresholds,omitempty"`
	Tokenizer               TokenizerConfig    `json:"tokenizer"`
	Split                   DatasetSplitConfig `json:"split"`
	AutoCalibrateThresholds bool               `json:"autoCalibrateThresholds"`
	TaxonomyEnabled         bool               `json:"taxonomyEnabled"`
}

TrainingConfigSnapshot stores effective config used for training.

type TrainingMetadata ¶

type TrainingMetadata struct {
	Seed                 int64                        `json:"seed"`
	TotalSampleCount     int                          `json:"totalSampleCount"`
	TrainSampleCount     int                          `json:"trainSampleCount"`
	ValSampleCount       int                          `json:"valSampleCount"`
	TestSampleCount      int                          `json:"testSampleCount"`
	Calibrated           bool                         `json:"calibrated"`
	CalibratedThresholds map[string]float64           `json:"calibratedThresholds,omitempty"`
	DataSummary          map[string]IntentDataSummary `json:"dataSummary,omitempty"`
	Config               TrainingConfigSnapshot       `json:"config"`
}

TrainingMetadata stores reproducibility and data summary info.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
intent-nlu-bundle command
intent-nlu-feedback command
intent-nlu-predict command
intent-nlu-train command
dataset
chatterbot

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

intent-nlu

What It Provides

Repository Layout

Quick Start

1) Run tests

2) Reproducible local evaluation (make eval)

3) One-click corpus training (zh,en default)

Intent Classes

Skill Routing Intents

Tool Routing Intents

Business Intents

Chitchat Intents

Taxonomy Aliases

Supported Languages

Routing Modes and Thresholds

Operational Notes

Training Workflows

A) One-click script (recommended)

B) Manual training CLI

Bundle Build CLI

Prediction

Single model

Multi-model map

Bundle

No model flags (use embedded default bundle)

Embedded Bundle (for dependency consumers)

Feedback Loop

Package Usage

Engine

Router

Hybrid Policy (rules -> NLU -> candidate/fallback)

Notes

Commands Summary

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func DefaultIntentAliases ¶

func ExtractEmbeddedBundle ¶

func NormalizeIntent ¶

func NormalizeThresholds ¶

func SaveBundleManifest ¶

func SaveSamplesCSV ¶

Types ¶

type BundleManifest ¶

func EmbeddedBundleManifest ¶

func LoadBundleManifest ¶

type BundleModelSummary ¶

type Candidate ¶

type ClassMetrics ¶

type DatasetSplitConfig ¶

func DefaultDatasetSplitConfig ¶

type DeterministicRule ¶

type Engine ¶

func NewEngineFromDir ¶

func (*Engine) Language ¶

func (*Engine) Meta ¶ added in v0.3.0

func (*Engine) Predict ¶

func (*Engine) Reload ¶

func (*Engine) Version ¶

type EvalReport ¶

type HybridDecision ¶

type HybridPolicy ¶

func (*HybridPolicy) Decide ¶

func (*HybridPolicy) Prepare ¶

type IntentDataSummary ¶

type Language ¶

func DetectLanguage ¶

type LanguageDetection ¶

func DetectLanguageDetailed ¶

type ModelMeta ¶

type PredictOptions ¶

type Prediction ¶

type Router ¶

func NewRouter ¶

func NewRouterFromBundle ¶

func NewRouterFromDirs ¶

func NewRouterFromEmbedded ¶

2) Reproducible local evaluation (`make eval`)

3) One-click corpus training (`zh,en` default)