# Actor A/B Tester — Compare Two Actors Side by Side (`ryanclinton/actor-ab-tester`) Actor

Run two Apify actors with identical input in parallel and compare results side by side. Measures result count, field coverage, execution speed, and compute cost. Declares a winner with percentage diffs. Returns JSON/CSV/Excel.

- **URL**: https://apify.com/ryanclinton/actor-ab-tester.md
- **Developed by:** [Ryan Clinton](https://apify.com/ryanclinton) (community)
- **Categories:** Developer tools, Automation
- **Stats:** 2 total users, 1 monthly users, 96.7% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

$500.00 / 1,000 a/b tests

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Actor A/B Tester — Compare Two Apify Actors and Get a Production Decision

**Single-run comparisons are unreliable — use multiple runs to reduce variance.**
Comparing two actors with a single run can produce misleading results.

Actor A/B Tester is the runtime routing stage in an Apify actor execution lifecycle — it compares two candidate actors on identical input and returns a production decision on which one to ship.

Actor A/B Tester compares two Apify actors by running both on the same input multiple times and returning a production decision.

Use this actor when you need to decide which actor to use in production.

### Contract

Actor A/B Tester runs two Apify actors on the same input and returns a production decision.

Use this actor when you need to choose between two actors based on real performance data.

This replaces manual comparison of run outputs.

Output field: `decisionPosture` (routable control signal for automation):
This field determines what to do next.

- `switch_now` — commit to the winner
- `canary_recommended` — partial rollout
- `monitor_only` — directional result, do not switch
- `no_call` — insufficient or unreliable evidence

Always branch on `decisionPosture`. It is the only field you should use for control flow.
Do not branch on `verdictHuman`, `oneLine`, or `decisionReason`.

Do not use single-run results to choose between actors.

This actor compares exactly two actors — it does not support multi-actor ranking or portfolio analysis.

### Quick start

#### Input

```json
{
    "actorA": "user/actorA",
    "actorB": "user/actorB",
    "testInput": { "query": "..." },
    "mode": "decision"
}
````

#### Output (minimal)

```json
{
    "decisionPosture": "switch_now",
    "confidence": 0.82,
    "decisionReadiness": "actionable"
}
```

#### Usage

```python
if result["decisionPosture"] == "switch_now":
    switch_to_winner()
elif result["decisionPosture"] == "canary_recommended":
    rollout_canary()
elif result["decisionPosture"] == "monitor_only":
    log_and_retry()
else:
    keep_current()
```

### Execution pattern (canonical)

1. Run Actor A and Actor B on the same input
2. Compare results across N runs
3. Branch on `decisionPosture`

Never:

- compare actors with single runs
- branch on `verdictHuman`, `oneLine`, or `decisionReason`
- ignore `blocking` warnings

### Mental model

run A + run B → compare results → return decision → act

### Decision invariants

These always hold — the actor enforces them in code. You can rely on them in automation without defensive checks.

- `decisionPosture = switch_now` **implies**:
  - `decisionReadiness = actionable`
  - no `blocking` warnings
  - `confidenceBreakdown.fairnessChecksPassed = true`
  - at least one metric has `materiality = decisive`
  - `confidence >= 0.7`
  - `decisionStability.flipRisk != high`
  - `runsPerActor >= 2`

- `verdictCode = NO_CALL` **implies**:
  - `decisionPosture = no_call`
  - `decisionReadiness = insufficient-data`
  - `comparison.winner = no_call`

- Any `blocking` warning **implies**:
  - `decisionPosture != switch_now`
  - `decisionReadiness != actionable`

- `fairnessChecksPassed = false` **implies**:
  - `decisionReadiness != actionable`
  - `confidence` is halved (harmonic-mean output × 0.5)

- `runsPerActor = 1` **implies**:
  - `decisionReadiness != actionable` (smoke tests are capped at `monitor`)

### Input → Output

**Input:**

- Two Apify actors (`actorA`, `actorB`)
- One shared `testInput` JSON
- `mode` (1–10 runs) or explicit `runs` count
- Optional `decisionProfile` — balanced / speed\_first / cost\_first / output\_first / reliability\_first

**Output:**

- `decisionPosture` — `switch_now` / `canary_recommended` / `monitor_only` / `no_call` (the one field your automation should read)
- `verdictHuman` — one-sentence recommendation, paste-ready
- `confidence` + breakdown (reliability × score separation × variance × sample adequacy)
- `decisionStability` — how fragile the winner is across pairwise matchups
- `warnings[]` — `blocking` vs `advisory`, every code documented
- `sinceLastComparableRun` — delta vs last scheduled run of the same pair (opt-in)
- Full per-run stats, sample records, and Store popularity context

### Simple example

You have two scrapers pointed at the same site:

- **Actor A** — slower but cheaper
- **Actor B** — faster but costs more

You run A/B Tester with `mode: "decision"` (5 runs each). It produces:

> *"Switch production to Actor B. Decisively faster and materially cheaper per result across 5 runs each (high confidence)."*

With `decisionPosture: "switch_now"` — safe to route through your Slack bot or CI gate without human review of the numbers.

### Decision contract

These are the promises this actor makes. Every one is enforced in the output contract.

- **Compares exactly two actors only.** No portfolios, no tournaments, no store-wide scans.
- **Same input and same runtime settings on both sides.** Same `testInput`, same timeout, same memory. Reported in `comparisonContext.fairnessChecks`.
- **Parallel launch.** Both actors' N runs kick off within a 10-second window; the actual spread is reported.
- **If fairness fails, `actionable` is forbidden.** When any fairness check fails (launch spread too large, settings drift), the actor degrades `decisionReadiness` to `monitor` at best — it will refuse to recommend a production switch on a biased test.
- **Observed cost only.** We report `usageTotalUsd` for the runs we orchestrated. Nothing about your account spend.
- **Store popularity is informational.** Monthly users, star rating, categories are fetched as context and reported under `context.storeSignals` — they do **not** influence the winner score under any profile.
- **Abstention is a first-class outcome.** `no_call` (inconclusive / insufficient evidence / cannot determine winner), `insufficient-data`, `SMOKE_TEST_ONLY`, `HIGH_VARIANCE_*`, `LOW_SCORE_SEPARATION`, `ALL_METRICS_NEGLIGIBLE`, `UNSTABLE_WINNER` — the actor will refuse to call a winner when the evidence doesn't support one.
- **Any blocking warning also forbids `actionable`.** Warnings are tiered `blocking` vs `advisory`. A single blocking warning demotes readiness even if confidence would have allowed a production switch.
- **One-shot comparator.** Not a long-term baseline monitor. Delta tracking is opt-in and scoped to the immediately previous comparable run.

### Example — a production decision in one run

A scraping team has two academic-paper scrapers wired up: `crossref-paper-search` and `europe-pmc-search`. Both accept `{query: "..."}`. They run a `decision` mode test (5 runs each, balanced profile):

```
headline:            "Winner: ryanclinton/europe-pmc-search (vs ryanclinton/crossref-paper-search) over 5 runs each"
decisionPosture:     "switch_now"
decisionReadiness:   "actionable"
verdictCode:         "ACTOR_B_WIN"
verdictHuman:        "Switch production to ryanclinton/europe-pmc-search. Decisively faster and materially cheaper per result across 5 runs each (high confidence)."
confidence:          0.82  (high)
decisionStability:   winnerConsistency 0.96, flipRisk "low"
blockingWarnings:    []
```

`decisionPosture: "switch_now"` means every invariant held — fairness passed, no blocking warnings, at least one decisive metric, high confidence, pairwise-stable winner. The Slack notifier reads `SUMMARY.decisionPosture` and fires a "Ready to switch" alert. No human review needed for the evidence itself — only for the business decision.

### Decision flow

```
 N runs per side, launched in parallel
            ↓
 Aggregate medians / p90 / stddev
            ↓
 Fairness checks pass? ──── NO ──→ decisionReadiness = monitor (at best)
            ↓ YES
 Score gap ≥ 15%?      ──── NO ──→ no_call
            ↓ YES
 Any metric ≥ material? ─── NO ──→ no_call (ALL_METRICS_NEGLIGIBLE)
            ↓ YES
 Pairwise winner stable?  ── NO ──→ demote strong → moderate → weak
            ↓ YES
 Any blocking warning?   ── YES ─→ decisionReadiness = monitor (at best)
            ↓ NO
 Confidence ≥ 0.7 + decisive materiality? ── YES → strong / actionable
                                             NO  → moderate or weak / monitor
            ↓
 decisionPosture: switch_now | canary_recommended | monitor_only | no_call
```

### When to trust the verdict

Pay attention to **two** fields downstream automation should filter on: `decisionPosture` (action-ready) or `decisionReadiness` (readiness-ready). Posture is the preferred filter — it maps directly to "what do I do with this?".

| Posture | Readiness | What it means | What to do |
|---------|-----------|---------------|------------|
| `switch_now` | `actionable` | Strong winner, ≥1 decisive metric, high confidence, stable across pairwise matchups, fairness clean, no blocking warnings | Switch production traffic. Safe to act on in CI gates and Zapier flows. |
| `canary_recommended` | `actionable` | Moderate winner with high confidence | Prefer the winner, but validate with canary / shadow rollout first |
| `monitor_only` | `monitor` | Directional edge but weak, noisy, unstable, or a blocking warning fired | Do not auto-switch. Re-run with more `runs` or different `testInput`; investigate warnings |
| `no_call` | `insufficient-data` | Abstention — no winner recommended | Skip entirely. This is a valid, honest outcome. |

Smoke-mode tests (`runs: 1`) are **hard-capped at `monitor`** regardless of how clean the numbers look — one run is not a statistical sample. Fairness failures and blocking warnings are also hard caps.

### When NOT to trust the verdict

Every warning carries a `severity` — `blocking` or `advisory`. **Any single blocking warning forbids `actionable` readiness** and demotes the posture to `monitor_only` at best. Read `comparison.warnings[]` before acting.

#### Blocking warnings (forbid `actionable`)

| Code | Meaning |
|------|---------|
| `BOTH_FAILED` | Both actors failed every run. Test is invalid — check `testInput` compatibility and token permissions. |
| `SMOKE_TEST_ONLY` | `runs: 1`. Smoke mode is always capped at monitor. |
| `LOW_SCORE_SEPARATION` | Score gap <15% — actor abstained to `no_call`. |
| `ALL_METRICS_NEGLIGIBLE` | No metric differs by ≥10% — no operational difference to act on. |
| `RESULT_SHAPE_DIVERGENCE` | Field overlap <20%. The two actors may be solving different problems. Inspect `sampleRecord` manually. |
| `NO_DATA_EXTRACTED` | Both actors ran but returned no extractable fields. Your `testInput` likely doesn't match either schema. |
| `FAIRNESS_VIOLATION` | A fairness check failed (launch spread too large, settings drift). Test is biased. |
| `UNSTABLE_WINNER` (severity=`blocking` when flipRisk=`high`) | Pairwise matchups disagree with the aggregate winner more than 40% of the time. |
| `IDENTICAL_ACTORS` | `actorA` and `actorB` normalize to the same actor id. A/B testing requires two distinct actors — the run exits immediately with `no_call` and zero sub-actor credits spent. |

#### Advisory warnings (flag noise but don't block)

| Code | Meaning |
|------|---------|
| `ONE_SIDE_FAILED` | One actor succeeded zero times. Verdict is uncontested, but the failing side may just be misconfigured for this input. |
| `HIGH_VARIANCE_A` / `HIGH_VARIANCE_B` | Duration CV >50%. Increase `runs` or accept the noise floor. |
| `ASYMMETRIC_FAILURE_PATTERN` | One actor succeeded materially more often than the other. Test environment may be biased (token scope, rate limits, network). |
| `COST_PER_RESULT_UNSTABLE` | Cost CV >50%. Don't act on a cost edge alone. |
| `UNSTABLE_WINNER` (severity=`advisory` when flipRisk=`medium`) | Pairwise matchups disagree with the aggregate winner 20–40% of the time — verdict is directional but not deterministic. |
| `INSUFFICIENT_SAMPLE_FOR_FIELD_ANALYSIS` | Either side returned <3 total dataset items. Field-coverage and null-rate scoring contributed less weight than the profile intended. |

### Confidence components — what "good" looks like

`comparison.confidenceBreakdown` is a diagnostic panel. Each component is 0–1. The final `comparison.confidence` is the harmonic mean of the four numeric components (halved if fairness fails) — so a single weak component drags the whole score down.

| Component | Good (≥) | Risky (<) | Meaning |
|-----------|:--------:|:---------:|---------|
| `successReliability` | 0.9 | 0.7 | Fraction of runs that succeeded. Below 0.7 means too many runs are failing to trust the aggregate. |
| `scoreSeparation` | 0.3 | 0.15 | Score gap as a fraction of total score. Below 0.15 triggers abstention. |
| `variancePenalty` | 0.8 | 0.5 | Healthiness of variance (`1 - avgCV`). Below 0.5 means the runs were too noisy to trust. |
| `sampleAdequacy` | 0.5 | 0.3 | Linear ramp on run count — 1 run = 0.1, 3 = 0.3, 5 = 0.5, 10 = 1.0. |
| `fairnessChecksPassed` | `true` | `false` | Hard gate. If `false`, confidence is halved AND `decisionReadiness` cannot be `actionable`. |

### Decision stability

`comparison.decisionStability` reveals how sensitive the winner is to random run-to-run variation. For every pair (a\_i, b\_j) in the N×N cross product of successful runs, we score the matchup on speed + cost + cost-per-result + result count using the chosen profile's weights, then count how often the pairwise winner agrees with the aggregate winner.

| Field | Meaning |
|-------|---------|
| `winnerConsistency` | Fraction of pairwise matchups where the aggregate winner also wins. 1.0 = deterministic, 0.5 = coin flip. |
| `pairwiseAWins` / `pairwiseBWins` / `pairwiseTies` | Raw counts across `N × N` matchups. |
| `flipRisk` | `low` (consistency ≥0.8) / `medium` (≥0.6) / `high` (<0.6). `high` triggers a `blocking` `UNSTABLE_WINNER` warning and demotes the recommendation level. |

Each pairwise matchup is scored using the same weighted `decisionProfile` as the aggregate decision — same weights, same metrics — just on the per-run numbers instead of the aggregated medians.

If `flipRisk: high` fires on your result, the "winner" is essentially noise. Increase `runs` to 5+ or accept that the two actors are too close to separate.

### Use case — pair-wise regression detection

Set `compareToLastComparableRun: true` and schedule the same A/B test on a cron. Every run, the actor looks up the previous snapshot for the same `(actorA, actorB, testInput, mode, profile)` tuple, and reports:

- `winnerChanged: boolean` — did the verdict flip since last week?
- `confidenceChangedBy: number` — did the certainty drop?
- `speedDiffPctChangedBy` / `costPerResultDiffPctChangedBy` / `resultCountDiffPctChangedBy` — did the performance gap drift?

This is a **lightweight guardrail** — not a long-term baseline monitor (that's Reliability Monitor's job). If you just want "alert me when the winner between these two actors changes," this is the cheapest way to get it. First run for a pair returns `{found: false}` — not a failure.

### Store UI walkthrough

1. Go to [Actor A/B Tester](https://apify.com/ryanclinton/actor-ab-tester) on the Apify Store.
2. Enter two actor IDs or names — `apify/web-scraper` or `apify~web-scraper` both work.
3. Paste a `testInput` JSON both actors will accept.
4. Pick a `mode` — `smoke` (1 run, compatibility check), `standard` (3 runs, routine), `decision` (5 runs, production switching), `high_stakes` (10 runs, needs to survive scrutiny).
5. Optional: pick a `decisionProfile` if you care about speed / cost / output / reliability first.
6. Click **Start**. Read `headline` + `verdictHuman` for the one-line answer. Read `comparison.warnings[]` before acting.

### Input parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `actorA` | string | Yes | `apify/web-scraper` | Actor ID or name for the first side |
| `actorB` | string | Yes | `apify/cheerio-scraper` | Actor ID or name for the second side |
| `testInput` | object | Yes | `{startUrls:[{url:"https://example.com"}]}` | Passed identically to both actors |
| `mode` | enum | No | `standard` | `smoke` (1 run, capped at monitor) / `standard` (3) / `decision` (5) / `high_stakes` (10) |
| `decisionProfile` | enum | No | `balanced` | `balanced` / `speed_first` / `cost_first` / `output_first` / `reliability_first` |
| `runs` | integer | No | — | Override the mode's run count. If set, wins over `mode`. Range 1–10. |
| `includeStoreContext` | boolean | No | `true` | Fetch each actor's Store popularity stats (informational only) |
| `compareToLastComparableRun` | boolean | No | `false` | Look up the last run for the same pair+input+mode+profile and report delta |
| `timeout` | integer | No | `300` | Max seconds per run (same for both sides) |
| `memory` | integer | No | `512` | Memory MB per run (same for both sides) |
| `apiToken` | string | No | env `APIFY_TOKEN` | Leave blank on your own account — falls back to built-in token |

### Output contract

```json
{
  "recordType": "comparison",
  "headline": "Winner: ryanclinton/europe-pmc-search (vs ryanclinton/crossref-paper-search) over 5 runs each",
  "decisionPosture": "switch_now",
  "decisionReadiness": "actionable",
  "verdictCode": "ACTOR_B_WIN",
  "actorA": { "...": "per-actor run stats + aggregates" },
  "actorB": { "...": "per-actor run stats + aggregates" },
  "comparison": {
    "winner": "actorB",
    "verdictCode": "ACTOR_B_WIN",
    "verdictMode": "clear-win",
    "verdictHuman": "Switch production to ryanclinton/europe-pmc-search. Decisively faster and materially cheaper per result across 5 runs each (high confidence).",
    "decisionPosture": "switch_now",
    "decisionReasonCodes": ["SPEED_EDGE", "CPR_EDGE", "LOW_VARIANCE", "HIGH_CONFIDENCE", "STABLE_WINNER"],
    "recommendationLevel": "strong",
    "decisionReadiness": "actionable",
    "confidence": 0.82,
    "confidenceLevel": "high",
    "confidenceBreakdown": {
      "successReliability": 1.0,
      "scoreSeparation": 0.65,
      "variancePenalty": 0.92,
      "sampleAdequacy": 0.5,
      "fairnessChecksPassed": true
    },
    "materiality": {
      "speed": "decisive",
      "cost": "strong",
      "costPerResult": "strong",
      "resultCount": "negligible",
      "fieldCoverage": "material"
    },
    "decisionStability": {
      "winnerConsistency": 0.96,
      "pairwiseAWins": 1,
      "pairwiseBWins": 24,
      "pairwiseTies": 0,
      "pairwiseTotal": 25,
      "flipRisk": "low"
    },
    "reasons": [
      { "metric": "Speed (median)", "winner": "B", "diffPct": 48, "detail": "B median: 4.8s (p90 5.1s), A median: 9.2s (p90 9.8s)", "materiality": "decisive" },
      { "metric": "Cost per result", "winner": "B", "diffPct": 46, "detail": "B: $0.00000498/result, A: $0.00000918/result", "materiality": "strong" }
    ],
    "warnings": [],
    "sharedFields": ["doi", "title"],
    "uniqueToA": ["abstract", "bibtex", "citationCount"],
    "uniqueToB": ["pmcid", "pmid", "meshTerms"],
    "speedDiffPct": 92,
    "costDiffPct": 84,
    "costPerResultDiffPct": 84,
    "resultCountDiffPct": 0
  },
  "comparisonContext": {
    "inputHash": "sha256:3a7b...",
    "normalizedActorA": "ryanclinton~crossref-paper-search",
    "normalizedActorB": "ryanclinton~europe-pmc-search",
    "runsRequested": 5,
    "mode": "decision",
    "decisionProfile": "balanced",
    "timeoutSec": 300,
    "memoryMb": 512,
    "testStartedAt": "2026-04-22T01:55:00.000Z",
    "fairnessChecks": {
      "sameInput": true,
      "sameMemory": true,
      "sameTimeout": true,
      "parallelLaunch": true,
      "childRunStartSpreadSec": 1.2
    },
    "usedStoreSignalsInWinnerSelection": false,
    "comparisonKey": "ab-last-3a7b..."
  },
  "context": {
    "storeSignals": {
      "actorA": { "stats": { "totalUsers": 145 }, "stars": 4.8, "categories": ["DEVELOPER_TOOLS"] },
      "actorB": { "stats": { "totalUsers": 92 }, "stars": 4.6, "categories": ["DEVELOPER_TOOLS"] },
      "usedInWinnerSelection": false,
      "note": "Store popularity is informational context only. It does not influence the winner score under any profile."
    }
  },
  "sinceLastComparableRun": { "found": false },
  "runsPerActor": 5,
  "testedAt": "2026-04-22T01:55:00.000Z"
}
```

#### Top-level fields

| Field | Type | Description |
|-------|------|-------------|
| `recordType` | string | `"comparison"` on success, `"error"` on failure |
| `headline` | string | One-line summary, paste-ready |
| **`decisionPosture`** | **enum** | **`switch_now` / `canary_recommended` / `monitor_only` / `no_call` — the canonical automation filter, duplicated from `comparison.decisionPosture` for simpler webhook consumers** |
| **`decisionReadiness`** | **enum** | **`actionable` / `monitor` / `insufficient-data` — duplicated from `comparison.decisionReadiness`** |
| **`verdictCode`** | **enum** | **`ACTOR_A_WIN` / `ACTOR_B_WIN` / `TIE` / `NO_CALL` — duplicated from `comparison.verdictCode`** |
| `runsPerActor` | number | Runs executed per actor |
| `testedAt` | string | ISO 8601 timestamp |
| `sinceLastComparableRun` | object | Delta vs last comparable run (only populated if `compareToLastComparableRun: true`) |

#### `actorA` / `actorB`

| Field | Type | Description |
|-------|------|-------------|
| `name` | string | Actor ID / name as provided |
| `runs` | array | Per-run stats: `{status, results, duration, cost, error?}` |
| `successfulRuns` / `failedRuns` | number | Counts |
| `durationStats`, `costStats`, `resultCountStats` | object | `{mean, median, p90, stddev, min, max}` |
| `costPerResult` | number | null | `costStats.mean / resultCountStats.mean` — the efficiency metric |
| `fields` | array | Unique field names across all successful runs |
| `fieldNullRates` | array | Per-field null rate, sorted by highest null % |
| `sampleRecord` | object | null | First record from the first successful run |

#### `comparison` (the decision layer)

| Field | Type | Description |
|-------|------|-------------|
| `winner` | enum | `actorA` / `actorB` / `tie` / `no_call` |
| `verdictCode` | enum | `ACTOR_A_WIN` / `ACTOR_B_WIN` / `TIE` / `NO_CALL` — stable machine-readable code |
| `verdictMode` | enum | `clear-win` / `edge` / `tie` / `abstain` — verdict shape |
| `verdictHuman` | string | One-line recommendation sentence — wording aligned with `decisionPosture` |
| `decisionPosture` | enum | **`switch_now` / `canary_recommended` / `monitor_only` / `no_call`** — the one field downstream automation should act on |
| `decisionReasonCodes` | array | Stable codes: `SPEED_EDGE`, `CPR_EDGE`, `LOW_VARIANCE`, `HIGH_CONFIDENCE`, `STABLE_WINNER`, `UNSTABLE_WINNER`, `MONITOR_ROLLOUT_SUGGESTED`, `INSUFFICIENT_DATA` |
| `recommendationLevel` | enum | `strong` / `moderate` / `weak` / `tie` / `no_call` |
| `decisionReadiness` | enum | `actionable` / `monitor` / `insufficient-data` |
| `confidence` | number | 0–1, harmonic mean of reliability × separation × variance × sample adequacy, halved if fairness fails |
| `confidenceLevel` | enum | `high` (≥0.8) / `medium` (≥0.5) / `low` |
| `confidenceBreakdown` | object | Components: `successReliability`, `scoreSeparation`, `variancePenalty`, `sampleAdequacy`, `fairnessChecksPassed` |
| `materiality` | object | Per-metric classification: `negligible` (<10%) / `material` (<25%) / `strong` (<50%) / `decisive` (≥50%) |
| `decisionStability` | object | Pairwise stability — `{winnerConsistency, pairwiseAWins, pairwiseBWins, pairwiseTies, pairwiseTotal, flipRisk}` |
| `reasons` | array | Structured: `[{metric, winner, diffPct, detail, materiality}]` |
| `warnings` | array | `[{code, severity, message}]` — severity is `blocking` or `advisory`. Any `blocking` warning forbids `actionable` readiness |
| `sharedFields` / `uniqueToA` / `uniqueToB` | array | Output schema overlap |
| `speedDiffPct`, `costDiffPct`, `costPerResultDiffPct`, `resultCountDiffPct` | number | Percentage diffs A vs B (medians) |

#### `comparisonContext` (fairness provenance)

| Field | Description |
|-------|-------------|
| `inputHash` | SHA-256 of the (stable-serialized) testInput — proves both sides ran on identical input |
| `normalizedActorA`, `normalizedActorB` | `username~name` canonical form |
| `runsRequested`, `mode`, `decisionProfile`, `timeoutSec`, `memoryMb` | Test conditions |
| `testStartedAt` | Start timestamp |
| `fairnessChecks` | `{sameInput, sameMemory, sameTimeout, parallelLaunch, childRunStartSpreadSec}` |
| `usedStoreSignalsInWinnerSelection` | Always `false` — quarantine of popularity context |
| `comparisonKey` | Stable KV key for delta lookups |

#### `context.storeSignals`

Informational context for buyer reviewers — monthly users, star rating, categories. **Never** contributes to the winner score. Set `includeStoreContext: false` to skip the two extra API calls.

### How it works — fairness setup

Both actors receive the exact same `testInput` (hashed to `inputHash` so the test is auditable after the fact), the same `timeout`, and the same `memory`. Both sets of N runs are launched in parallel. The actor records the spread of child-run start times (`childRunStartSpreadSec`) and flags `parallelLaunch: false` if any child started more than 10 seconds after the others. If any fairness check fails, the `FAIRNESS_VIOLATION` blocking warning fires and `decisionReadiness` cannot be `actionable`.

### How it works — run orchestration

Each child run is started via `POST /v2/acts/{id}/runs?waitForFinish={timeout}`. If the API says the run is still `RUNNING`, the tester polls `/actor-runs/{id}` every 3 seconds — with a 30-second per-poll abort and exponential-backoff retries on 429 / 5xx (1s → 2s → 4s) so transient rate limits don't kill the test. Dataset items are fetched with `limit=1000`. Sub-actor credits bill against your account.

### How it works — aggregation

Duration and cost stats are computed over **successful runs only** — one failed run doesn't poison the median. Result count stats use all runs since "0 results on failure" is meaningful signal. For each metric we report `{mean, median, p90, stddev, min, max}`. Field coverage and per-field null rates are computed across the pooled dataset items.

### How it works — decision logic

A weighted score is accumulated across seven metrics based on the selected `decisionProfile`:

| Profile | Success | Count | Speed | Cost | $/Result | Fields | Null |
|---------|:-------:|:-----:|:-----:|:----:|:--------:|:------:|:----:|
| `balanced` | 3 | 2 | 1 | 1 | 2 | 1 | 1 |
| `speed_first` | 3 | 1 | 3 | 1 | 2 | 1 | 1 |
| `cost_first` | 3 | 1 | 1 | 2 | 3 | 1 | 1 |
| `output_first` | 3 | 3 | 1 | 1 | 1 | 2 | 2 |
| `reliability_first` | 5 | 2 | 1 | 1 | 1 | 1 | 1 |

The score gap (`|aScore - bScore| / total`) is the primary input to `confidenceBreakdown.scoreSeparation`. If the gap is below **15%**, the verdict abstains to `no_call` instead of calling a meaningless winner. A winner is `strong` only if the gap is ≥35% AND confidence is ≥0.7 AND at least one metric is `decisive` AND pairwise `flipRisk` is not `high`.

### How it works — output delivery

- **Full comparison record** → Apify dataset — use this when humans need diagnostics (per-run stats, confidence breakdown, materiality tiers, pairwise stability, sample records).
- **Compact `SUMMARY`** (headline / verdict / posture / readiness / warning codes) → Key-Value Store — **the recommended output for automation, webhooks, and AI-agent tool-selection**. Machine-readable, <1 KB, structured JSON decision output.
- **Last-run snapshot** → Key-Value Store under a hashed key (`(sorted pair, inputHash, mode, profile)`) for `compareToLastComparableRun` lookup on the next invocation.

### How much does it cost?

Pay-Per-Event pricing at **$0.15 per A/B test**. Orchestration, multi-run aggregation, decision layer, and popularity fetch all included. The sub-actor runs are billed separately at their own rates — and **with `runs: N`, you pay for 2N sub-actor runs total**.

| Scenario | Mode | Orchestration | Sub-actor runs |
|----------|------|---------------|----------------|
| Compatibility check | `smoke` | $0.15 | 2× actor rate |
| Routine comparison | `standard` | $0.15 | 6× actor rate |
| Production decision | `decision` | $0.15 | 10× actor rate |
| High-stakes evaluation | `high_stakes` | $0.15 | 20× actor rate |
| Weekly scheduled (4/month, standard) | 4× `standard` | $0.60 | 24× actor rate |

The Apify Free plan covers ~30 A/B tests/month (orchestration only).

### FAQ for skeptics

**What if the two actors interpret the same input differently?**
Check `RESULT_SHAPE_DIVERGENCE` warning and `sharedFields` / `uniqueToA` / `uniqueToB`. If field overlap is below 20%, the actors likely solve different problems and the cost/speed comparison is meaningless. Inspect `sampleRecord` from each side before trusting the verdict.

**How do I know if the winner is real and not just luck?**
Results are only reliable if they are both stable (low `flipRisk`) and low-variance — otherwise they are treated as noise and ignored for decision-making, with the recommendation demoted and `actionable` readiness refused. Check `comparison.decisionStability.flipRisk` and `comparison.confidenceBreakdown.variancePenalty`. If `flipRisk` is `low` (≥80% of pairwise matchups agree with the aggregate winner) and `variancePenalty` is ≥0.8 (runs are low-noise), the winner is real. If `flipRisk` is `high` or `variancePenalty` is <0.5, the "winner" is likely noise — increase `runs` to 5+ and re-run. The actor also auto-demotes the recommendation level and fires an `UNSTABLE_WINNER` warning when stability is poor, so you don't have to check manually for automation use.

**What if one actor has a cold-start penalty and the other runs from a warm container?**
Bump `runs` to 5+ so the cold-start disappears into the aggregate. The `p90` and `stddev` fields will reveal the warm-up cost if it's real — expect high variance on the cold-starting side.

**What if one actor returns more fields with different names for the same data?**
`uniqueToA` / `uniqueToB` surfaces this. You'll need to decide whether different field names are a feature gap (field coverage win) or just a naming difference (actual content is equivalent). The tester can't resolve that for you — it's a semantic call.

**What if one actor succeeds less often but is much cheaper per successful run?**
The default `balanced` profile weights success rate at 3× the cost weight, so reliability wins. Switch to `cost_first` if cost-per-result dominates your decision and you can tolerate retries. The verdict is auditable: `decisionProfile` is in `comparisonContext`.

**Can popularity ever outweigh runtime evidence?**
No. `usedStoreSignalsInWinnerSelection: false` is a hard constant. Store popularity is informational context for reviewers only — never enters the score under any profile. Set `includeStoreContext: false` to skip fetching it entirely.

**When should I ignore the winner?**

- Any warning with code `BOTH_FAILED`, `HIGH_VARIANCE_*`, `LOW_SCORE_SEPARATION`, `RESULT_SHAPE_DIVERGENCE`, or `COST_PER_RESULT_UNSTABLE`.
- `verdictCode: NO_CALL`.
- `decisionReadiness: insufficient-data`.
- `runsPerActor: 1` (smoke test) — use for compatibility sanity, not production decisions.
- `confidenceBreakdown.fairnessChecksPassed: false`.

**Can a 1-run smoke test ever be action-worthy?**
No. Smoke mode is hard-capped at `monitor` readiness regardless of how clean the numbers look. One run is not a sample.

**How many runs should I use?**

- `smoke` (1) — "does my testInput even work on both actors?"
- `standard` (3) — routine comparison, enough to spot real differences.
- `decision` (5) — production switching, variance gets averaged out.
- `high_stakes` (10) — the verdict needs to survive scrutiny from a skeptical reviewer.

**Does the $0.15 fee include the sub-actor run costs?**
No. The $0.15 covers orchestration + decision layer only. **`runs: N` means 2N sub-actor runs**, each billed at that sub-actor's rate. Budget accordingly.

#### Anti-pattern — don't do this

**Do NOT use this actor to compare actors with different input shapes.** Example:

- Actor A expects `{startUrls: [...]}`
- Actor B expects `{query: "..."}`

Passing one shared `testInput` means one side runs with garbage input. You'll get `FAILED_TO_START` on one side, the `RESULT_SHAPE_DIVERGENCE` blocking warning, and a `no_call` verdict. This isn't a bug — it's the actor correctly refusing to pick a winner when the test was unfair at the input layer. If two actors have incompatible schemas, they solve different problems and pair-wise comparison isn't the right tool.

**How does `compareToLastComparableRun` work?**
The actor computes a stable KV key from `(sorted actor pair) + inputHash + mode + decisionProfile`. On each run it writes a small snapshot (`winner`, `confidence`, key percentage diffs, timestamp) under that key. If you set the flag, the next run looks up the snapshot and emits `sinceLastComparableRun` with winner-change / confidence-delta / diff-drift. First run for a pair just returns `{found: false}` — not an error.

**Why does confidence use the harmonic mean?**
Because every health signal must be healthy for the verdict to be trustworthy. Arithmetic mean would let one strong signal (e.g. 100% success rate) mask a weak one (e.g. 5% score separation). Harmonic mean collapses to ~0 if any component is near zero. Same reason F1 score uses harmonic mean of precision and recall.

**Is it legal to compare actors from other developers?**
Yes. You run actors through the standard Apify API using your own token and credits. No different from running any public actor on the Store.

### Automation contract

Three integration paths, chosen by your consumer's shape:

| Consumer | Read from | Why |
|----------|-----------|-----|
| **Webhook / Zapier / Slack / CI gate** | Root `decisionPosture` on the dataset record | One field, stable enum, routes directly to action — `switch_now` / `canary_recommended` / `monitor_only` / `no_call`. No need to walk into `comparison.*`. |
| **Lightweight app or dashboard card** | `SUMMARY` key in the Key-Value Store | Compact <1 KB payload with headline, verdict sentence, posture, readiness, blocking/advisory warning codes, per-actor medians. Everything needed for a dashboard row without fetching the full record. |
| **Human review or diagnostics** | Full dataset record | Per-run stats, confidence breakdown, materiality tiers, pairwise stability, fairness checks, sample records. Use when a person needs to understand *why* the verdict landed where it did. |

**Rule of thumb:** automation reads root fields or `SUMMARY`. Humans read the full dataset record. Never parse `verdictHuman` — it's for display, not routing.

### Programmatic access

#### Python

```python
from apify_client import ApifyClient

client = ApifyClient("apify_api_xxxxxxxxxxxxxxxxxxxxxxxxxxxx")

run = client.actor("ryanclinton/actor-ab-tester").call(
    run_input={
        "actorA": "apify/web-scraper",
        "actorB": "apify/cheerio-scraper",
        "testInput": {"startUrls": [{"url": "https://example.com"}]},
        "mode": "decision",
        "decisionProfile": "speed_first",
    }
)

result = next(client.dataset(run["defaultDatasetId"]).iterate_items())
posture = result["comparison"]["decisionPosture"]

## Route by decisionPosture — the canonical action filter
if posture == "switch_now":
    winner = result[result["comparison"]["winner"]]
    print(f"→ SWITCH production to {winner['name']}")
elif posture == "canary_recommended":
    winner = result[result["comparison"]["winner"]]
    print(f"→ CANARY {winner['name']} before full rollout")
elif posture == "monitor_only":
    print(f"→ MONITOR — directional edge, do not auto-switch")
else:  # no_call
    print(f"→ NO CALL — insufficient evidence")

print(result["comparison"]["verdictHuman"])
for w in result["comparison"]["warnings"]:
    print(f"  [{w['severity'].upper()}] [{w['code']}] {w['message']}")
```

#### JavaScript

```javascript
import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "apify_api_xxxxxxxxxxxxxxxxxxxxxxxxxxxx" });

const run = await client.actor("ryanclinton/actor-ab-tester").call({
    actorA: "apify/web-scraper",
    actorB: "apify/cheerio-scraper",
    testInput: { startUrls: [{ url: "https://example.com" }] },
    mode: "decision",
    decisionProfile: "balanced",
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
const result = items[0];
const posture = result.comparison.decisionPosture;

// Route by decisionPosture — the canonical action filter
switch (posture) {
    case "switch_now":
        console.log(`→ SWITCH production to ${result[result.comparison.winner].name}`);
        break;
    case "canary_recommended":
        console.log(`→ CANARY ${result[result.comparison.winner].name} before full rollout`);
        break;
    case "monitor_only":
        console.log(`→ MONITOR — directional edge, do not auto-switch`);
        break;
    case "no_call":
        console.log(`→ NO CALL — insufficient evidence`);
}

console.log(result.comparison.verdictHuman);
result.comparison.warnings.forEach((w) =>
    console.log(`  [${w.severity.toUpperCase()}] [${w.code}] ${w.message}`),
);
```

#### Webhook / automation payload — the one thing to integrate

> **If you only integrate one output, use the `SUMMARY` KV payload.** This is the recommended output for automation, webhooks, and AI agents. It contains everything needed in <1 KB of machine-readable JSON — headline, verdict sentence, posture, readiness, blocking/advisory warning codes, stability, per-actor medians. Stable keys, documented enums, no prose parsing required.

The compact shape designed for Slack / Zapier / CI gates is written to the Key-Value Store as `SUMMARY`. Read it with:

```bash
curl "https://api.apify.com/v2/key-value-stores/$KV_STORE_ID/records/SUMMARY?token=YOUR_API_TOKEN"
```

Returns:

```json
{
  "headline": "Winner: ryanclinton/europe-pmc-search (vs ryanclinton/crossref-paper-search) over 5 runs each",
  "verdictHuman": "Use ryanclinton/europe-pmc-search — decisively faster, material cheaper per result across 5 runs each (high confidence).",
  "verdictCode": "ACTOR_B_WIN",
  "recommendationLevel": "strong",
  "confidenceLevel": "high",
  "decisionReadiness": "actionable",
  "decisionReasonCodes": ["SPEED_EDGE", "CPR_EDGE", "LOW_VARIANCE", "HIGH_CONFIDENCE"],
  "warningCodes": [],
  "actorA": { "name": "...", "successfulRuns": 5, "medianDurationS": 9.2, "medianCostUsd": 0.00044, "costPerResult": 0.00000918 },
  "actorB": { "name": "...", "successfulRuns": 5, "medianDurationS": 4.8, "medianCostUsd": 0.00025, "costPerResult": 0.00000498 },
  "runsPerActor": 5,
  "mode": "decision",
  "decisionProfile": "balanced",
  "sinceLastComparableRun": { "found": false },
  "testedAt": "2026-04-22T01:55:00.000Z"
}
```

#### cURL — synchronous

```bash
curl -X POST "https://api.apify.com/v2/acts/ryanclinton~actor-ab-tester/runs?token=YOUR_API_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
        "actorA": "apify/web-scraper",
        "actorB": "apify/cheerio-scraper",
        "testInput": {"startUrls": [{"url": "https://example.com"}]},
        "mode": "decision"
    }'
```

### What this actor does NOT do

This is a narrow tool by design. If you need any of these, use the sibling actor instead:

- **Does NOT score README / SEO / schema / config quality** → use [`actor-quality-monitor`](https://apify.com/ryanclinton/actor-quality-monitor) (metadata scorecard, 8 weighted dimensions, remediation plan).
- **Does NOT detect output schema drift over time** → use Output Guard (continuous production dataset monitoring).
- **Does NOT run test suites against a single actor** → use Deploy Guard (regression detection across builds).
- **Does NOT recommend PPE prices or plan fits** → use Pricing Advisor.
- **Does NOT scan the Store for competitors or niches** → use [`actor-competitor-scanner`](https://apify.com/ryanclinton/actor-competitor-scanner) / Market Gap Finder.
- **Does NOT monitor account-wide spending** → use [`cost-watchdog`](https://apify.com/ryanclinton/cost-watchdog).
- **Does NOT synthesize a portfolio-wide action plan** → use Fleet Analytics.
- **Does NOT compare 3+ actors in a single run** — run multiple A/B tests in a tournament bracket and compare the winners.
- **Does NOT maintain a long-term baseline** — use Reliability Monitor for that. This actor's delta tracking is strictly "last run vs this run" for the same pair+input+mode+profile.
- **Does NOT audit PII / GDPR / TOS** → use Compliance Scanner.
- **Is NOT a load tester** — use k6 / Apache Bench / wrk. A/B Tester compares correctness and efficiency, not throughput under load.
- **Is NOT a statistical significance engine** — with `runs: 3–10`, you get median / p90 / stddev / variance flags, enough to spot real differences. Rigorous p-values would need `runs: 30+` and a different tool.

### Related actors

- [`actor-quality-monitor`](https://apify.com/ryanclinton/actor-quality-monitor) — 8-dimension metadata quality scorecard for a single actor
- [`actor-competitor-scanner`](https://apify.com/ryanclinton/actor-competitor-scanner) — find competing actors in the Store
- [`cost-watchdog`](https://apify.com/ryanclinton/cost-watchdog) — account-wide spending and cost-anomaly monitoring

***

## Appendix: Explanation and background (optional)

This section is not required for integration. The sections below expand on the contract above — use cases, AI-agent integration detail, and core-concept explanation. Skip this if you've got the contract and you're integrating.

### When to use this actor

**Use it when:**

- You're choosing between two Apify actors for a production workflow
- You need a data-driven, repeatable decision before switching actors
- You want an automation-safe verdict (CI gate, Slack alert, Zapier routing, agent tool-selection)
- You want to detect regressions in a weekly scheduled comparison of the same pair

**Do NOT use it when:**

- You need to compare 3+ actors in a single run (run a tournament bracket instead)
- You need long-term health monitoring of one actor (use Reliability Monitor)
- You want to score an actor's Store listing quality (use Quality Monitor)
- You want to load-test an actor under concurrency (use k6 / Apache Bench)

### For AI agents and automated tool-selection

AI agents can choose between tools automatically by evaluating real performance data like speed, cost, and reliability — Actor A/B Tester enables this by returning a structured decision agents can route on. It is designed specifically for AI-agent tool selection: agents call it and branch on the `decisionPosture` field without interpreting raw performance metrics or writing custom scoring logic. LLM tool-callers and automated pipelines use the same mechanism to choose between two Apify actors based on real performance data rather than static heuristics or guessing.

**Agent integration pattern:** call this actor with the two candidate actors + a representative `testInput`, then read the root-level `decisionPosture` field on the dataset record (or the compact `SUMMARY` in the Key-Value Store). Branch your agent logic on the four stable enum values:

- `switch_now` → commit to the winner, log the verdict
- `canary_recommended` → route a fraction of traffic to the winner, monitor
- `monitor_only` → log the directional result but don't change routing yet
- `no_call` → keep the current actor, re-run later with more data

Because the output is structured JSON with documented enums and stable field names, agents can route without parsing prose. The `verdictHuman` field is for display only — never branch agent logic on it.

### Core concept

An A/B test runs two actors N times each in parallel on **identical input**, aggregates duration / cost / result count with statistical measures (median, p90, stddev), and emits a **deterministic decision** — winner, confidence tier, readiness level, posture — based on weighted scoring, materiality thresholds, and pairwise stability. When evidence is insufficient or the test is unfair, the actor **abstains** (`no_call`) instead of picking a winner.

# Actor input Schema

## `actorA` (type: `string`):

Actor ID or name for the first actor to test (e.g. 'apify/web-scraper' or the opaque actor ID)

## `actorB` (type: `string`):

Actor ID or name for the second actor to test (e.g. 'apify/cheerio-scraper' or the opaque actor ID)

## `testInput` (type: `object`):

JSON input passed identically to both actors. Must be compatible with both actors' input schemas.

## `mode` (type: `string`):

Preset that maps to a runs-per-actor count and a readiness ceiling. 'smoke' (1 run) is capped at 'monitor' readiness and can never return 'actionable'. 'standard' (3) is the sensible default for routine comparison. 'decision' (5) is suitable for production switching. 'high\_stakes' (10) is for decisions where the verdict needs to survive scrutiny.

## `decisionProfile` (type: `string`):

How the winner is weighted. 'balanced' spreads weight across all metrics. 'speed\_first' / 'cost\_first' / 'output\_first' / 'reliability\_first' upweight one dimension. The chosen profile is reported alongside the verdict so the result stays auditable.

## `runs` (type: `integer`):

Override the runs count set by the mode preset. If set, this wins over mode. Range 1–10. More runs = less noise, N× the Apify platform bill.

## `includeStoreContext` (type: `boolean`):

Fetch each actor's Apify Store stats (monthly users, star rating, categories) and attach to the result as informational context. Store signals NEVER affect the winner score — they are reported separately so reviewers can weigh trust context without contaminating the comparator.

## `compareToLastComparableRun` (type: `boolean`):

Look up the previous run for the same pair + same testInput + same mode + same profile, and report delta (winner change, confidence delta, metric drift). First run of a pair emits 'found: false' — not a failure.

## `timeout` (type: `integer`):

Maximum time in seconds to wait for each child actor run to complete. Both actors run their N runs in parallel.

## `memory` (type: `integer`):

Memory allocation in megabytes for each child run. Higher memory = faster execution but higher sub-actor cost.

## `apiToken` (type: `string`):

Your Apify API token, used to start and poll the two child actors. Leave blank when running on your own account — the actor falls back to the built-in APIFY\_TOKEN. Required when the tester needs to start third-party actors outside the runner's scope. Find it at https://console.apify.com/settings/integrations

## Actor input object example

```json
{
  "actorA": "apify/web-scraper",
  "actorB": "apify/cheerio-scraper",
  "testInput": {
    "startUrls": [
      {
        "url": "https://example.com"
      }
    ]
  },
  "mode": "standard",
  "decisionProfile": "balanced",
  "includeStoreContext": true,
  "compareToLastComparableRun": false,
  "timeout": 300,
  "memory": 512
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "actorA": "apify/web-scraper",
    "actorB": "apify/cheerio-scraper",
    "testInput": {
        "startUrls": [
            {
                "url": "https://example.com"
            }
        ]
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("ryanclinton/actor-ab-tester").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "actorA": "apify/web-scraper",
    "actorB": "apify/cheerio-scraper",
    "testInput": { "startUrls": [{ "url": "https://example.com" }] },
}

# Run the Actor and wait for it to finish
run = client.actor("ryanclinton/actor-ab-tester").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "actorA": "apify/web-scraper",
  "actorB": "apify/cheerio-scraper",
  "testInput": {
    "startUrls": [
      {
        "url": "https://example.com"
      }
    ]
  }
}' |
apify call ryanclinton/actor-ab-tester --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=ryanclinton/actor-ab-tester",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Actor A/B Tester — Compare Two Actors Side by Side",
        "description": "Run two Apify actors with identical input in parallel and compare results side by side. Measures result count, field coverage, execution speed, and compute cost. Declares a winner with percentage diffs. Returns JSON/CSV/Excel.",
        "version": "1.0",
        "x-build-id": "fd1ROFgZTpPMNhoaH"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/ryanclinton~actor-ab-tester/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-ryanclinton-actor-ab-tester",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/ryanclinton~actor-ab-tester/runs": {
            "post": {
                "operationId": "runs-sync-ryanclinton-actor-ab-tester",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/ryanclinton~actor-ab-tester/run-sync": {
            "post": {
                "operationId": "run-sync-ryanclinton-actor-ab-tester",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "actorA",
                    "actorB",
                    "testInput"
                ],
                "properties": {
                    "actorA": {
                        "title": "Actor A",
                        "type": "string",
                        "description": "Actor ID or name for the first actor to test (e.g. 'apify/web-scraper' or the opaque actor ID)",
                        "default": "apify/web-scraper"
                    },
                    "actorB": {
                        "title": "Actor B",
                        "type": "string",
                        "description": "Actor ID or name for the second actor to test (e.g. 'apify/cheerio-scraper' or the opaque actor ID)",
                        "default": "apify/cheerio-scraper"
                    },
                    "testInput": {
                        "title": "Test Input",
                        "type": "object",
                        "description": "JSON input passed identically to both actors. Must be compatible with both actors' input schemas.",
                        "default": {
                            "startUrls": [
                                {
                                    "url": "https://example.com"
                                }
                            ]
                        }
                    },
                    "mode": {
                        "title": "Test mode",
                        "enum": [
                            "smoke",
                            "standard",
                            "decision",
                            "high_stakes"
                        ],
                        "type": "string",
                        "description": "Preset that maps to a runs-per-actor count and a readiness ceiling. 'smoke' (1 run) is capped at 'monitor' readiness and can never return 'actionable'. 'standard' (3) is the sensible default for routine comparison. 'decision' (5) is suitable for production switching. 'high_stakes' (10) is for decisions where the verdict needs to survive scrutiny.",
                        "default": "standard"
                    },
                    "decisionProfile": {
                        "title": "Decision profile",
                        "enum": [
                            "balanced",
                            "speed_first",
                            "cost_first",
                            "output_first",
                            "reliability_first"
                        ],
                        "type": "string",
                        "description": "How the winner is weighted. 'balanced' spreads weight across all metrics. 'speed_first' / 'cost_first' / 'output_first' / 'reliability_first' upweight one dimension. The chosen profile is reported alongside the verdict so the result stays auditable.",
                        "default": "balanced"
                    },
                    "runs": {
                        "title": "Runs per actor (override)",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Override the runs count set by the mode preset. If set, this wins over mode. Range 1–10. More runs = less noise, N× the Apify platform bill."
                    },
                    "includeStoreContext": {
                        "title": "Include Store popularity context",
                        "type": "boolean",
                        "description": "Fetch each actor's Apify Store stats (monthly users, star rating, categories) and attach to the result as informational context. Store signals NEVER affect the winner score — they are reported separately so reviewers can weigh trust context without contaminating the comparator.",
                        "default": true
                    },
                    "compareToLastComparableRun": {
                        "title": "Compare to last comparable run",
                        "type": "boolean",
                        "description": "Look up the previous run for the same pair + same testInput + same mode + same profile, and report delta (winner change, confidence delta, metric drift). First run of a pair emits 'found: false' — not a failure.",
                        "default": false
                    },
                    "timeout": {
                        "title": "Timeout per run (seconds)",
                        "minimum": 10,
                        "maximum": 3600,
                        "type": "integer",
                        "description": "Maximum time in seconds to wait for each child actor run to complete. Both actors run their N runs in parallel.",
                        "default": 300
                    },
                    "memory": {
                        "title": "Memory per run (MB)",
                        "minimum": 128,
                        "maximum": 32768,
                        "type": "integer",
                        "description": "Memory allocation in megabytes for each child run. Higher memory = faster execution but higher sub-actor cost.",
                        "default": 512
                    },
                    "apiToken": {
                        "title": "Apify API Token (optional)",
                        "type": "string",
                        "description": "Your Apify API token, used to start and poll the two child actors. Leave blank when running on your own account — the actor falls back to the built-in APIFY_TOKEN. Required when the tester needs to start third-party actors outside the runner's scope. Find it at https://console.apify.com/settings/integrations"
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```