Benchmarks
-74.0%
Mean token savings across read, grep, and edit — measured against ashlrai/ashlr-plugin, commit b98da9c, on April 26, 2026.
files measured
750
lines of code
149,462
read samples
16
grep patterns
5
Per-tool breakdown
ashlr__read
dark bar = mean · mid = p50 · light = p90
ashlr__grep
dark bar = mean · mid = p50 · light = p90
ashlr__edit
dark bar = mean · mid = p50 · light = p90
ashlr__edit “small” scenario (15-char change) shows ratio > 1 by design: the diff header is longer than the trivial before/after for tiny changes. Medium and large edits compress well. This is reported honestly.
Read sample scatter
ashlr__read: file size vs. reduction
Each dot is one sampled file. x-axis = raw file size; y-axis = tokens saved. Files below 2 KB are excluded (snipCompact only fires above that threshold).
Methodology
Measurement methodology (version 2):
**ashlr__read**: For each sampled source file, we measure raw file bytes and
token count (chars/4 heuristic). We then apply the same snipCompact
transformation used at runtime — wrapping the content in a tool_result message
and calling snipCompact() — and measure the resulting byte/token count. The
ratio is ashlrTokens / rawTokens. Files below 2 KB are excluded because
snipCompact only fires on tool results > 2 000 chars; savings are zero by
design for small files.
Files are selected deterministically: the repo HEAD commit SHA is folded into a
32-bit seed (mulberry32 PRNG), then up to 4 files are sampled from each of four
size buckets (2–5 KB, 5–15 KB, 15–50 KB, 50+ KB). Re-running on the same
commit always picks the same files.
**ashlr__grep**: Five common patterns (function, import, TODO, class, interface)
are run via rg --json against the repo root. Raw output bytes are measured
directly. The ashlr__grep fallback path (no genome) truncates output to 4 000
chars (head 2 000 + tail 1 000). The ratio is truncated/raw.
Note: when a .ashlrcode/genome/ index is present, real-world grep savings are
substantially higher. This benchmark measures only the conservative
no-genome baseline.
**ashlr__edit**: Three synthetic edits (small ~15 chars, medium ~300 chars,
large ~3 000 chars) compare the naive "ship before+after as text" approach
against ashlr__edit's diff-summary format (one header line + removed/added
first-lines). The ratio is summary tokens / naive tokens.
**Aggregation**: per-tool mean/p50/p90 are computed over each tool's ratio
values (lower ratio = more savings). The `overall.mean` is pooled across
every individual sample regardless of tool — so tools with more samples (read
has 15, grep has 5, edit has 3) weight the overall figure proportionally.
That makes the headline number reflect the workload mix, not a uniform
per-tool average. The unweighted mean of per-tool means is intentionally NOT
published because it gives equal weight to a 3-sample tool and a 15-sample
tool, which over-weights the synthetic edit overhead.
Token counts use the chars/4 heuristic, the same estimator the plugin uses at
runtime for savings accounting.
Reproduce it yourself
Run the benchmark against any git repo you have locally:
# against the plugin itself (dogfood) bun run scripts/run-benchmark.ts --repo . # against any other repo bun run scripts/run-benchmark.ts --repo /path/to/repo --out /tmp/results.json # dry-run (no file written — useful for CI checks) bun run scripts/run-benchmark.ts --dry-run
Requires: bun, git, ripgrep (rg). Same commit SHA always picks the same files.
Raw data
The full JSON result file — every sample, every ratio, the exact methodology string.
Download benchmarks-v2.json