SN11 Grapples with Scoring Variance and Evaluation Issues
Share
Community members reported recurring problems with miner re-evaluation producing inconsistent scores and unfair winner selection. The subnet team acknowledged a 12-hour rate limit submission launching this epoch and plans broader feature rollouts within two epochs. A detailed variance reduction analysis identified the testee model's tool-calling reliability as the root cause, recommending model replacement, temperature control, multi-episode testing, and removal of unsolvable scenarios to reduce coefficient of variation from ~60% to ~20%.
- •12-hour rate limit submission launches this epoch
- •Testee model (Qwen 3.5B) mode-flips between execution and description modes
- •Recommend larger model replacement and temperature ≤0.1 for stability
- •Run 3 episodes per scenario instead of 1 to reduce variance
- •Document all evaluation parameters currently opaque to miners
Distilled from 29 team messages in the official Bittensor Discord. Generated by Claude Haiku 4.5.
View original messages
- Discord message 1502158703003177110
- Discord message 1502158871106555927
- Discord message 1502159199793320046
- Discord message 1502159362918060082
- Discord message 1502159519051153479
- Discord message 1502168355048656966
- Discord message 1502175488834015263
- Discord message 1502181559178887198
- Discord message 1502181802565959771
- Discord message 1502299279824261262
- Discord message 1502301464939204608
- Discord message 1502301544630976572
- Discord message 1502310998978072708
- Discord message 1502314150007935036