CAM-PULSE A/B Knowledge Proof — Paired Within-Subject Experiment

92.3%
Variant Success (with KB)

73.1%

Control Success (no KB)

p=.015
Wilcoxon Signed-Rank

0.45

Cohen's dz (medium)

6 : 1

Discordant Wins (var : ctrl)

Experiment Design

Why this experiment is trustworthy: the paired within-subject design eliminates every known confound.

Does CAM-PULSE's knowledge base (3,044 mined coding methodologies from 329 real repos) measurably improve AI agent code quality?

26 coding tasks on the graphify repository (326 tests, 0.6s feedback loop)
Each task run twice on the same agent: once with KB (variant), once without (control)
5 agents (Codex, Claude, Gemini, Grok, local Ollama) assigned round-robin
Blind: neither agent nor verifier knows which arm it's in
Arm order randomized per pair to prevent order effects
Workspace reset to clean state between every run
7-check verifier + pytest + 6-dimensional SWE quality metric

Agent confounding: Same agent for both arms — if codex is better than local, it helps both arms equally
Task difficulty: Same task for both arms — a hard task is hard for both
Sample imbalance: Exactly 1 control + 1 variant per pair
Selection bias: Tasks curated from prior trial (≥50% historical success rate)

Each block is one pair. Green = variant won. Red = control won. Gray = tie (<0.01 diff).

Variant

92.3% (24/26)

Control

73.1% (19/26)

All 5 agents show positive mean diff — the KB effect is agent-independent.

Every pair, every score. Hover for details.

#	Task	Agent	Control	Variant	Diff	C	V	Result

All one-sided (H₁: variant > control). Multiple independent tests converge.

Test	Statistic	p-value	Interpretation
Paired t-test	t = 2.248	p = 0.017	Significant at α=0.05
Wilcoxon signed-rank	W = 122	p = 0.015	Non-parametric confirmation
McNemar (binary)	6 vs 1 discordant	p = 0.063	Marginal; 6:1 ratio is striking
Bootstrap 95% CI	[+0.023, +0.270]	excludes 0	Effect is reliably positive
Cohen's dz	0.45	—	Medium effect size