Benchmarks
-74.0%

Mean token savings across read, grep, and edit — measured against ashlrai/ashlr-plugin, commit b98da9c, on April 26, 2026.

files measured
750
lines of code
149,462
read samples
16
grep patterns
5

Per-tool breakdown

ashlr__read

mean-82.1%p50-87.8%p90-53.2%
dark bar = mean  ·  mid = p50  ·  light = p90

ashlr__grep

mean-92.8%p50-98.6%p90-76.3%
dark bar = mean  ·  mid = p50  ·  light = p90

ashlr__edit

mean--0.5%p50-52.0%p90--150.0%
dark bar = mean  ·  mid = p50  ·  light = p90

ashlr__edit “small” scenario (15-char change) shows ratio > 1 by design: the diff header is longer than the trivial before/after for tiny changes. Medium and large edits compress well. This is reported honestly.

Read sample scatter

ashlr__read: file size vs. reduction

0%25%50%75%100%93.2 KB186.4 KB279.6 KB372.8 KBfile sizetokens saved-53.2%-50.3%-60.2%-61.4%-87.8%-82.4%-79.4%-71.3%-91.5%-91.3%-96.1%-92.1%-99.0%-99.3%-98.1%-99.6%

Each dot is one sampled file. x-axis = raw file size; y-axis = tokens saved. Files below 2 KB are excluded (snipCompact only fires above that threshold).

Methodology

Measurement methodology (version 2): **ashlr__read**: For each sampled source file, we measure raw file bytes and token count (chars/4 heuristic). We then apply the same snipCompact transformation used at runtime — wrapping the content in a tool_result message and calling snipCompact() — and measure the resulting byte/token count. The ratio is ashlrTokens / rawTokens. Files below 2 KB are excluded because snipCompact only fires on tool results > 2 000 chars; savings are zero by design for small files. Files are selected deterministically: the repo HEAD commit SHA is folded into a 32-bit seed (mulberry32 PRNG), then up to 4 files are sampled from each of four size buckets (2–5 KB, 5–15 KB, 15–50 KB, 50+ KB). Re-running on the same commit always picks the same files. **ashlr__grep**: Five common patterns (function, import, TODO, class, interface) are run via rg --json against the repo root. Raw output bytes are measured directly. The ashlr__grep fallback path (no genome) truncates output to 4 000 chars (head 2 000 + tail 1 000). The ratio is truncated/raw. Note: when a .ashlrcode/genome/ index is present, real-world grep savings are substantially higher. This benchmark measures only the conservative no-genome baseline. **ashlr__edit**: Three synthetic edits (small ~15 chars, medium ~300 chars, large ~3 000 chars) compare the naive "ship before+after as text" approach against ashlr__edit's diff-summary format (one header line + removed/added first-lines). The ratio is summary tokens / naive tokens. **Aggregation**: per-tool mean/p50/p90 are computed over each tool's ratio values (lower ratio = more savings). The `overall.mean` is pooled across every individual sample regardless of tool — so tools with more samples (read has 15, grep has 5, edit has 3) weight the overall figure proportionally. That makes the headline number reflect the workload mix, not a uniform per-tool average. The unweighted mean of per-tool means is intentionally NOT published because it gives equal weight to a 3-sample tool and a 15-sample tool, which over-weights the synthetic edit overhead. Token counts use the chars/4 heuristic, the same estimator the plugin uses at runtime for savings accounting.

Reproduce it yourself

Run the benchmark against any git repo you have locally:

# against the plugin itself (dogfood)
bun run scripts/run-benchmark.ts --repo .

# against any other repo
bun run scripts/run-benchmark.ts --repo /path/to/repo --out /tmp/results.json

# dry-run (no file written — useful for CI checks)
bun run scripts/run-benchmark.ts --dry-run

Requires: bun, git, ripgrep (rg). Same commit SHA always picks the same files.

Raw data

The full JSON result file — every sample, every ratio, the exact methodology string.

Download benchmarks-v2.json