llm

package module

v0.1.0 Latest Latest Go to latest Published: May 15, 2026 License: Apache-2.0 Imports: 23 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/msradam/xk6-llm

Links

Open Source Insights

README ¶

xk6-llm

A k6 extension for benchmarking OpenAI-compatible chat-completions servers. Streaming-first. Emits TTFT, ITL, TPOT, goodput, and token-throughput metrics; per-chunk timing matches vllm bench serve semantics.

xk6-llm Grafana dashboard

1,470 requests, 0 errors, 100% goodput, $0.382 total cost. Three-turn conversations against ibm-granite/granite-4.1-30b on a single NVIDIA B300 SXM6. The "TTFT by turn" panel surfaces prefix-cache speedup across turns; the cost and energy panels are derived from server-reported token counts.

xk6-llm demo

Recorded against Ollama on an M3 MacBook Air. The same script and dashboard work against a hosted API or a vLLM cluster; only the numbers change.

Build

go install go.k6.io/xk6/cmd/xk6@latest
xk6 build --with github.com/msradam/xk6-llm@latest --output build/k6
./build/k6 version

Example

import llm from 'k6/x/llm';

const client = new llm.Client({
  base_url: 'http://localhost:11434/v1',
  model:    'granite4.1:3b',
});

export default async function () {
  const res = await client.chat({
    messages:   [{ role: 'user', content: 'Write a haiku about Poisson arrivals.' }],
    max_tokens: 128,
    temperature: 0,
  });
  console.log(`ttft=${res.ttft_ms.toFixed(1)}ms tpot=${res.tpot_ms.toFixed(1)}ms tokens=${res.completion_tokens}`);
}

./build/k6 run examples/chat.js

Metrics

Every metric is tagged model. Errors are additionally tagged error_type. Per-request cache_state and arbitrary tags are propagated when supplied.

Name	Type	Description
`llm_requests`	Counter	Successful chat completions.
`llm_errors`	Counter	Failures. Tag `error_type` in `{network, timeout, http_4xx, http_5xx, stream, decode}`.
`llm_request_duration`	Trend (Time)	End-to-end wall time.
`llm_response_headers`	Trend (Time)	Submit to HTTP response headers.
`llm_ttft`	Trend (Time)	Time to first token. First content-bearing SSE chunk; role-only deltas skipped.
`llm_itl`	Trend (Time)	Per-chunk inter-arrival vector. First sample is `t[chunk₂] - t[chunk₁]`.
`llm_tpot`	Trend (Time)	Scalar `(e2e - ttft) / (n - 1)`. Emitted only when `completion_tokens > 1`.
`llm_chunks_per_request`	Trend	Content chunks per request.
`llm_prompt_tokens`	Counter	Server-reported `usage.prompt_tokens`.
`llm_completion_tokens`	Counter	Server-reported `usage.completion_tokens`.
`llm_goodput`	Rate	All SLOs met. Emitted only when an `slo` predicate is supplied.
`llm_slo_ttft`	Rate	`ttft_ms <= slo.ttft_ms`.
`llm_slo_tpot`	Rate	`tpot_ms <= slo.tpot_ms`.
`llm_slo_e2el`	Rate	`duration_ms <= slo.e2el_ms`.
`llm_cost_usd`	Trend	USD per request. Emitted only when `cost` is supplied.
`llm_energy_j`	Trend	Estimated joules per request. Emitted only when `energy` is supplied.
`llm_energy_j_per_token`	Trend	`llm_energy_j / completion_tokens`.

API

`new llm.Client(opts)`

{
  base_url?:   string,                          // default: http://localhost:11434/v1
  api_key?:    string,
  model?:      string,
  timeout_ms?: number,                          // default: 60000
  ignore_eos?: boolean,
  headers?:    Record<string, string>,
  slo?:        { ttft_ms?, tpot_ms?, e2el_ms? },
  cost?:       { usd_per_million_input_tokens?, usd_per_million_output_tokens? },
  energy?:     { j_per_input_token?, j_per_output_token?, idle_w? },
}

`client.chat(req)`

req accepts the OpenAI chat-completion fields (messages, max_tokens, temperature, top_p, seed, etc.) plus optional slo, cache_state, and tags. Returns a Promise resolving to:

{
  content:             string,
  ttft_ms:             number,
  itl_ms:              number[],
  tpot_ms:             number,
  duration_ms:         number,
  response_headers_ms: number,
  chunks:              number,
  prompt_tokens:       number,
  completion_tokens:   number,
  finish_reason:       string,
}

`new llm.Dataset(opts)`

Replays a JSONL prompt corpus. One line per request: {"messages": [...], "max_tokens"?: N, ...}. Loaded once per process and cached by absolute path.

{
  path:     string,
  seed?:    number,    // default: 42
  shuffle?: boolean,
}

Methods: dataset.size(), dataset.next(), dataset.at(i), dataset.reset().

A converter for ShareGPT V3 to this format lives at scripts/sharegpt_to_jsonl.py.

What you can simulate

k6's load generator is a JavaScript runtime, so a single VU can carry conversation state, branch on responses, and drive multi-call workflows. xk6-llm doesn't define a session or agent abstraction; instead, the patterns live in examples/ and use the existing Client plus k6's tags to make the workflow visible in the dashboard.

Example	What it shows
`multi-turn.js`	A 5-turn conversation per VU iteration. Tags `cache_state` and `turn` on every call so the dashboard can show TTFT degradation across turns and prefix-cache speedup (typically 5 to 15x by turn 5).
`agent.js`	Tool-calling loop. Model emits `TOOL: name(arg)` or `DONE: answer`; the script runs the tool, feeds the result back, repeats. Tags `agent_id` and `iteration` so the dashboard rolls up per-session totals and full-envelope p95.
`rag.js`	Embed -> vector retrieve -> generate. Each phase has its own k6 Trend (`rag_embed_ms`, `rag_retrieve_ms`) with independent SLO thresholds. Set `EMBED_URL` and `RETRIEVE_URL` to point at real services.
`ab-providers.js`	Two `Client` instances under two scenarios with different `cost` configs. Identical traffic, side-by-side latency and dollar-cost panels. Procurement decision in one screenshot.

Quickstart with Grafana

A Docker Compose stack with a pre-provisioned dashboard is in quickstart/. See QUICKSTART.md.

Validation

Cross-validated against vllm bench serve on a real vLLM 0.21.0 server (Qwen2.5-72B-Instruct-AWQ, A100 80GB). TPOT/ITL/E2EL agree within 5% on identical workloads; token counts are bit-exact with vLLM /metrics. Numbers and methodology in docs/validation-parity.md and docs/validation-results.md.

Compatibility

xk6-llm	k6	xk6
v0.x	v2.0.0	1.4.1

Attribution

This codebase was developed with assistance from Claude Code. PRs are reviewed and merged by humans.

License

Apache-2.0

Documentation ¶

Overview ¶

Package llm registers `k6/x/llm`, a k6 extension for LLM-aware load testing. See the project README for the metric set and per-request semantics.

Index ¶

type Client
- func (c *Client) Chat(req map[string]any) *sobek.Promise
type CostModel
- func (c *CostModel) Empty() bool
- func (c *CostModel) USD(promptTokens, completionTokens int) float64
type Dataset
type EnergyModel
- func (e *EnergyModel) Empty() bool
- func (e *EnergyModel) Joules(promptTokens, completionTokens int, duration time.Duration) float64
type Options
type SLOPredicate
- func (s *SLOPredicate) Empty() bool

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Client ¶

type Client struct {
	// contains filtered or unexported fields
}

Client is the JS-facing OpenAI-compatible chat client.

func (*Client) Chat ¶

func (c *Client) Chat(req map[string]any) *sobek.Promise

Chat sends a streaming chat completion request. Returns a Promise resolving to a result object (see chatResult.toJSObject for the full shape) or rejecting with the categorized error.

JS:

client.chat({
  messages: [...],
  max_tokens: 256,
  // control fields (peeled off before the upstream POST):
  slo:         { ttft_ms: 500, tpot_ms: 50, e2el_ms: 5000 },
  cache_state: "cold",
  tags:        { region: "us-east", shape: "short" },
})

type CostModel ¶

type CostModel struct {
	USDPerMInputTokens  float64
	USDPerMOutputTokens float64
}

CostModel parameterises a per-request USD cost estimate, computed from server-reported token counts:

usd = prompt_tokens     * usd_per_million_input_tokens  / 1e6
    + completion_tokens * usd_per_million_output_tokens / 1e6

Use hosted-API published rates directly; for self-hosted inference, compute your effective $/M-token rate offline (idle GPU $/hr divided by sustained throughput, plus marginal electricity) and plug it in here.

func (*CostModel) Empty ¶

func (c *CostModel) Empty() bool

Empty reports whether the model would produce zero for any request.

func (*CostModel) USD ¶

func (c *CostModel) USD(promptTokens, completionTokens int) float64

USD returns the dollar cost for a request with the given token counts.

type Dataset ¶

type Dataset struct {
	// contains filtered or unexported fields
}

Dataset is a deterministic, replayable corpus of chat requests. Loaded once per process from a JSONL file and shared across VUs via dsCache; per-Dataset instances each carry their own cursor and shuffle permutation so two VUs reading from the same file do not see the same order unless they share a (path, seed) pair.

func (*Dataset) At ¶

func (d *Dataset) At(i int64) map[string]any

At returns the i-th request (modulo dataset size, negative-safe) without advancing the cursor. Use this when the caller wants to derive the index from __VU and __ITER for fully reproducible workloads.

func (*Dataset) Next ¶

func (d *Dataset) Next() map[string]any

Next advances the internal cursor and returns the next request, wrapping at the end. Concurrency-safe across VUs (within a single process) when the same Dataset instance is reused; in k6 each VU constructs its own instance, so "wrap" semantics apply per VU.

func (*Dataset) Reset ¶

func (d *Dataset) Reset()

Reset rewinds the internal cursor. Useful for repeated runs in tests.

func (*Dataset) Size ¶

func (d *Dataset) Size() int

Size returns the number of items in the dataset.

type EnergyModel ¶

type EnergyModel struct {
	JPerInputToken  float64
	JPerOutputToken float64
	IdleW           float64
}

EnergyModel parameterises a per-request energy estimate. The math, per request:

dynamic_j = prompt_tokens * j_per_input_token + completion_tokens * j_per_output_token
static_j  = idle_w * (duration_s)
total_j   = dynamic_j + static_j

Coefficients must be measured for your (GPU, model, batch regime) tuple. Under concurrent load the static term over-attributes idle power; divide idle_w by your expected per-VU concurrency for wall-plug accuracy. This is a budgeting metric, not a measurement.

func (*EnergyModel) Empty ¶

func (e *EnergyModel) Empty() bool

Empty reports whether the model would produce zero for any request.

func (*EnergyModel) Joules ¶

func (e *EnergyModel) Joules(promptTokens, completionTokens int, duration time.Duration) float64

Joules returns the total energy budget for a request with the given token counts and wall-clock duration. Zero when Empty.

type Options ¶

type Options struct {
	BaseURL   string
	APIKey    string
	Model     string
	Timeout   time.Duration
	IgnoreEOS bool
	// Headers are sent on every request. Use for custom auth schemes, gateway
	// routing keys (e.g. OpenRouter "HTTP-Referer"), or observability headers.
	Headers map[string]string
	// DefaultSLO applies to every chat() call that doesn't supply its own.
	DefaultSLO *SLOPredicate
	// Energy, when set, enables per-request energy estimation. See EnergyModel.
	Energy *EnergyModel
	// Cost, when set, enables per-request USD estimation. See CostModel.
	Cost *CostModel
}

Options configures an llm.Client.

type SLOPredicate ¶

type SLOPredicate struct {
	TTFTMs float64
	TPOTMs float64
	E2ELMs float64
}

SLOPredicate is the per-request SLO used to compute goodput and per-SLO attainment Rates. A zero field disables that SLO (it always passes).

Semantics match vLLM's `--goodput ttft:X tpot:Y e2el:Z` flag (PR #9338, shipped v0.6.4).

func (*SLOPredicate) Empty ¶

func (s *SLOPredicate) Empty() bool

Empty reports whether the predicate is functionally a no-op.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL