llm

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 15, 2026 License: Apache-2.0 Imports: 23 Imported by: 0

README

xk6-llm

A k6 extension for benchmarking OpenAI-compatible chat-completions servers. Streaming-first. Emits TTFT, ITL, TPOT, goodput, and token-throughput metrics; per-chunk timing matches vllm bench serve semantics.

xk6-llm Grafana dashboard

1,470 requests, 0 errors, 100% goodput, $0.382 total cost. Three-turn conversations against ibm-granite/granite-4.1-30b on a single NVIDIA B300 SXM6. The "TTFT by turn" panel surfaces prefix-cache speedup across turns; the cost and energy panels are derived from server-reported token counts.

xk6-llm demo

Recorded against Ollama on an M3 MacBook Air. The same script and dashboard work against a hosted API or a vLLM cluster; only the numbers change.

Build

go install go.k6.io/xk6/cmd/xk6@latest
xk6 build --with github.com/msradam/xk6-llm@latest --output build/k6
./build/k6 version

Example

import llm from 'k6/x/llm';

const client = new llm.Client({
  base_url: 'http://localhost:11434/v1',
  model:    'granite4.1:3b',
});

export default async function () {
  const res = await client.chat({
    messages:   [{ role: 'user', content: 'Write a haiku about Poisson arrivals.' }],
    max_tokens: 128,
    temperature: 0,
  });
  console.log(`ttft=${res.ttft_ms.toFixed(1)}ms tpot=${res.tpot_ms.toFixed(1)}ms tokens=${res.completion_tokens}`);
}
./build/k6 run examples/chat.js

Metrics

Every metric is tagged model. Errors are additionally tagged error_type. Per-request cache_state and arbitrary tags are propagated when supplied.

Name Type Description
llm_requests Counter Successful chat completions.
llm_errors Counter Failures. Tag error_type in {network, timeout, http_4xx, http_5xx, stream, decode}.
llm_request_duration Trend (Time) End-to-end wall time.
llm_response_headers Trend (Time) Submit to HTTP response headers.
llm_ttft Trend (Time) Time to first token. First content-bearing SSE chunk; role-only deltas skipped.
llm_itl Trend (Time) Per-chunk inter-arrival vector. First sample is t[chunk₂] - t[chunk₁].
llm_tpot Trend (Time) Scalar (e2e - ttft) / (n - 1). Emitted only when completion_tokens > 1.
llm_chunks_per_request Trend Content chunks per request.
llm_prompt_tokens Counter Server-reported usage.prompt_tokens.
llm_completion_tokens Counter Server-reported usage.completion_tokens.
llm_goodput Rate All SLOs met. Emitted only when an slo predicate is supplied.
llm_slo_ttft Rate ttft_ms <= slo.ttft_ms.
llm_slo_tpot Rate tpot_ms <= slo.tpot_ms.
llm_slo_e2el Rate duration_ms <= slo.e2el_ms.
llm_cost_usd Trend USD per request. Emitted only when cost is supplied.
llm_energy_j Trend Estimated joules per request. Emitted only when energy is supplied.
llm_energy_j_per_token Trend llm_energy_j / completion_tokens.

API

new llm.Client(opts)
{
  base_url?:   string,                          // default: http://localhost:11434/v1
  api_key?:    string,
  model?:      string,
  timeout_ms?: number,                          // default: 60000
  ignore_eos?: boolean,
  headers?:    Record<string, string>,
  slo?:        { ttft_ms?, tpot_ms?, e2el_ms? },
  cost?:       { usd_per_million_input_tokens?, usd_per_million_output_tokens? },
  energy?:     { j_per_input_token?, j_per_output_token?, idle_w? },
}
client.chat(req)

req accepts the OpenAI chat-completion fields (messages, max_tokens, temperature, top_p, seed, etc.) plus optional slo, cache_state, and tags. Returns a Promise resolving to:

{
  content:             string,
  ttft_ms:             number,
  itl_ms:              number[],
  tpot_ms:             number,
  duration_ms:         number,
  response_headers_ms: number,
  chunks:              number,
  prompt_tokens:       number,
  completion_tokens:   number,
  finish_reason:       string,
}
new llm.Dataset(opts)

Replays a JSONL prompt corpus. One line per request: {"messages": [...], "max_tokens"?: N, ...}. Loaded once per process and cached by absolute path.

{
  path:     string,
  seed?:    number,    // default: 42
  shuffle?: boolean,
}

Methods: dataset.size(), dataset.next(), dataset.at(i), dataset.reset().

A converter for ShareGPT V3 to this format lives at scripts/sharegpt_to_jsonl.py.

What you can simulate

k6's load generator is a JavaScript runtime, so a single VU can carry conversation state, branch on responses, and drive multi-call workflows. xk6-llm doesn't define a session or agent abstraction; instead, the patterns live in examples/ and use the existing Client plus k6's tags to make the workflow visible in the dashboard.

Example What it shows
multi-turn.js A 5-turn conversation per VU iteration. Tags cache_state and turn on every call so the dashboard can show TTFT degradation across turns and prefix-cache speedup (typically 5 to 15x by turn 5).
agent.js Tool-calling loop. Model emits TOOL: name(arg) or DONE: answer; the script runs the tool, feeds the result back, repeats. Tags agent_id and iteration so the dashboard rolls up per-session totals and full-envelope p95.
rag.js Embed -> vector retrieve -> generate. Each phase has its own k6 Trend (rag_embed_ms, rag_retrieve_ms) with independent SLO thresholds. Set EMBED_URL and RETRIEVE_URL to point at real services.
ab-providers.js Two Client instances under two scenarios with different cost configs. Identical traffic, side-by-side latency and dollar-cost panels. Procurement decision in one screenshot.

Quickstart with Grafana

A Docker Compose stack with a pre-provisioned dashboard is in quickstart/. See QUICKSTART.md.

Validation

Cross-validated against vllm bench serve on a real vLLM 0.21.0 server (Qwen2.5-72B-Instruct-AWQ, A100 80GB). TPOT/ITL/E2EL agree within 5% on identical workloads; token counts are bit-exact with vLLM /metrics. Numbers and methodology in docs/validation-parity.md and docs/validation-results.md.

Compatibility

xk6-llm k6 xk6
v0.x v2.0.0 1.4.1

Attribution

This codebase was developed with assistance from Claude Code. PRs are reviewed and merged by humans.

License

Apache-2.0

Documentation

Overview

Package llm registers `k6/x/llm`, a k6 extension for LLM-aware load testing. See the project README for the metric set and per-request semantics.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client is the JS-facing OpenAI-compatible chat client.

func (*Client) Chat

func (c *Client) Chat(req map[string]any) *sobek.Promise

Chat sends a streaming chat completion request. Returns a Promise resolving to a result object (see chatResult.toJSObject for the full shape) or rejecting with the categorized error.

JS:

client.chat({
  messages: [...],
  max_tokens: 256,
  // control fields (peeled off before the upstream POST):
  slo:         { ttft_ms: 500, tpot_ms: 50, e2el_ms: 5000 },
  cache_state: "cold",
  tags:        { region: "us-east", shape: "short" },
})

type CostModel

type CostModel struct {
	USDPerMInputTokens  float64
	USDPerMOutputTokens float64
}

CostModel parameterises a per-request USD cost estimate, computed from server-reported token counts:

usd = prompt_tokens     * usd_per_million_input_tokens  / 1e6
    + completion_tokens * usd_per_million_output_tokens / 1e6

Use hosted-API published rates directly; for self-hosted inference, compute your effective $/M-token rate offline (idle GPU $/hr divided by sustained throughput, plus marginal electricity) and plug it in here.

func (*CostModel) Empty

func (c *CostModel) Empty() bool

Empty reports whether the model would produce zero for any request.

func (*CostModel) USD

func (c *CostModel) USD(promptTokens, completionTokens int) float64

USD returns the dollar cost for a request with the given token counts.

type Dataset

type Dataset struct {
	// contains filtered or unexported fields
}

Dataset is a deterministic, replayable corpus of chat requests. Loaded once per process from a JSONL file and shared across VUs via dsCache; per-Dataset instances each carry their own cursor and shuffle permutation so two VUs reading from the same file do not see the same order unless they share a (path, seed) pair.

func (*Dataset) At

func (d *Dataset) At(i int64) map[string]any

At returns the i-th request (modulo dataset size, negative-safe) without advancing the cursor. Use this when the caller wants to derive the index from __VU and __ITER for fully reproducible workloads.

func (*Dataset) Next

func (d *Dataset) Next() map[string]any

Next advances the internal cursor and returns the next request, wrapping at the end. Concurrency-safe across VUs (within a single process) when the same Dataset instance is reused; in k6 each VU constructs its own instance, so "wrap" semantics apply per VU.

func (*Dataset) Reset

func (d *Dataset) Reset()

Reset rewinds the internal cursor. Useful for repeated runs in tests.

func (*Dataset) Size

func (d *Dataset) Size() int

Size returns the number of items in the dataset.

type EnergyModel

type EnergyModel struct {
	JPerInputToken  float64
	JPerOutputToken float64
	IdleW           float64
}

EnergyModel parameterises a per-request energy estimate. The math, per request:

dynamic_j = prompt_tokens * j_per_input_token + completion_tokens * j_per_output_token
static_j  = idle_w * (duration_s)
total_j   = dynamic_j + static_j

Coefficients must be measured for your (GPU, model, batch regime) tuple. Under concurrent load the static term over-attributes idle power; divide idle_w by your expected per-VU concurrency for wall-plug accuracy. This is a budgeting metric, not a measurement.

func (*EnergyModel) Empty

func (e *EnergyModel) Empty() bool

Empty reports whether the model would produce zero for any request.

func (*EnergyModel) Joules

func (e *EnergyModel) Joules(promptTokens, completionTokens int, duration time.Duration) float64

Joules returns the total energy budget for a request with the given token counts and wall-clock duration. Zero when Empty.

type Options

type Options struct {
	BaseURL   string
	APIKey    string
	Model     string
	Timeout   time.Duration
	IgnoreEOS bool
	// Headers are sent on every request. Use for custom auth schemes, gateway
	// routing keys (e.g. OpenRouter "HTTP-Referer"), or observability headers.
	Headers map[string]string
	// DefaultSLO applies to every chat() call that doesn't supply its own.
	DefaultSLO *SLOPredicate
	// Energy, when set, enables per-request energy estimation. See EnergyModel.
	Energy *EnergyModel
	// Cost, when set, enables per-request USD estimation. See CostModel.
	Cost *CostModel
}

Options configures an llm.Client.

type SLOPredicate

type SLOPredicate struct {
	TTFTMs float64
	TPOTMs float64
	E2ELMs float64
}

SLOPredicate is the per-request SLO used to compute goodput and per-SLO attainment Rates. A zero field disables that SLO (it always passes).

Semantics match vLLM's `--goodput ttft:X tpot:Y e2el:Z` flag (PR #9338, shipped v0.6.4).

func (*SLOPredicate) Empty

func (s *SLOPredicate) Empty() bool

Empty reports whether the predicate is functionally a no-op.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL