SN11trajectory-rl·Saturday, May 9, 2026

SN11 Grapples with Scoring Variance and Evaluation Issues

Community members reported recurring problems with miner re-evaluation producing inconsistent scores and unfair winner selection. The subnet team acknowledged a 12-hour rate limit submission launching this epoch and plans broader feature rollouts within two epochs. A detailed variance reduction analysis identified the testee model's tool-calling reliability as the root cause, recommending model replacement, temperature control, multi-episode testing, and removal of unsolvable scenarios to reduce coefficient of variation from ~60% to ~20%.

•12-hour rate limit submission launches this epoch
•Testee model (Qwen 3.5B) mode-flips between execution and description modes
•Recommend larger model replacement and temperature ≤0.1 for stability
•Run 3 episodes per scenario instead of 1 to reduce variance
•Document all evaluation parameters currently opaque to miners

Distilled from 29 team messages in the official Bittensor Discord. Generated by Claude Haiku 4.5.

View original messages

SN11 Grapples with Scoring Variance and Evaluation Issues

More briefs for SN11