Benchmarks

-74.0%

Mean token savings across read, grep, and edit — measured against ashlrai/ashlr-plugin, commit b98da9c, on April 26, 2026.

files measured

750

lines of code

149,462

read samples

16

grep patterns

5

Per-tool breakdown

ashlr__read

dark bar = mean  ·  mid = p50  ·  light = p90

ashlr__grep

dark bar = mean  ·  mid = p50  ·  light = p90

ashlr__edit

dark bar = mean  ·  mid = p50  ·  light = p90

ashlr__edit “small” scenario (15-char change) shows ratio > 1 by design: the diff header is longer than the trivial before/after for tiny changes. Medium and large edits compress well. This is reported honestly.

Read sample scatter

ashlr__read: file size vs. reduction

Each dot is one sampled file. x-axis = raw file size; y-axis = tokens saved. Files below 2 KB are excluded (snipCompact only fires above that threshold).

Methodology

Measurement methodology (version 2):

**ashlr__read**: For each sampled source file, we measure raw file bytes and
token count (chars/4 heuristic). We then apply the same snipCompact
transformation used at runtime — wrapping the content in a tool_result message
and calling snipCompact() — and measure the resulting byte/token count. The
ratio is ashlrTokens / rawTokens. Files below 2 KB are excluded because
snipCompact only fires on tool results > 2 000 chars; savings are zero by
design for small files.

Files are selected deterministically: the repo HEAD commit SHA is folded into a
32-bit seed (mulberry32 PRNG), then up to 4 files are sampled from each of four
size buckets (2–5 KB, 5–15 KB, 15–50 KB, 50+ KB). Re-running on the same
commit always picks the same files.

**ashlr__grep**: Five common patterns (function, import, TODO, class, interface)
are run via rg --json against the repo root. Raw output bytes are measured
directly. The ashlr__grep fallback path (no genome) truncates output to 4 000
chars (head 2 000 + tail 1 000). The ratio is truncated/raw.

Note: when a .ashlrcode/genome/ index is present, real-world grep savings are
substantially higher. This benchmark measures only the conservative
no-genome baseline.

**ashlr__edit**: Three synthetic edits (small ~15 chars, medium ~300 chars,
large ~3 000 chars) compare the naive "ship before+after as text" approach
against ashlr__edit's diff-summary format (one header line + removed/added
first-lines). The ratio is summary tokens / naive tokens.

**Aggregation**: per-tool mean/p50/p90 are computed over each tool's ratio
values (lower ratio = more savings). The `overall.mean` is pooled across
every individual sample regardless of tool — so tools with more samples (read
has 15, grep has 5, edit has 3) weight the overall figure proportionally.
That makes the headline number reflect the workload mix, not a uniform
per-tool average. The unweighted mean of per-tool means is intentionally NOT
published because it gives equal weight to a 3-sample tool and a 15-sample
tool, which over-weights the synthetic edit overhead.

Token counts use the chars/4 heuristic, the same estimator the plugin uses at
runtime for savings accounting.

Reproduce it yourself

Run the benchmark against any git repo you have locally:

# against the plugin itself (dogfood)
bun run scripts/run-benchmark.ts --repo .

# against any other repo
bun run scripts/run-benchmark.ts --repo /path/to/repo --out /tmp/results.json

# dry-run (no file written — useful for CI checks)
bun run scripts/run-benchmark.ts --dry-run

Requires: bun, git, ripgrep (rg). Same commit SHA always picks the same files.

Raw data

The full JSON result file — every sample, every ratio, the exact methodology string.

Download benchmarks-v2.json