Affine fixes NAVWORLD scoring instability with parallel judges
Share
NAVWORLD scoring exhibited extreme variance—identical outputs received different scores (30% to 70% variance) due to single LLM judge bias. The team deployed a fix invoking all judges in parallel and computing median scores per dimension across 3 LLMs (or 2 when unavailable), eliminating single-judge bias. Terminal module completed shadow-run testing and fixes; scoring integration expected Tuesday or Wednesday.
- •LLM judge inconsistency caused >40% score swings on same task
- •New parallel judge median approach ensures reproducible results
- •Terminal module ready for scoring rollout after bug fixes
Distilled from 20 team messages in the official Bittensor Discord. Generated by Claude Haiku 4.5.
View original messages
- Discord message 1497762715236831233
- Discord message 1497762956132221038
- Discord message 1497764269792563292
- Discord message 1497764381188952235
- Discord message 1497764755719454810
- Discord message 1497772191612272770
- Discord message 1497772626716655677
- Discord message 1497773308060631110
- Discord message 1497773729453707375
- Discord message 1497774125798658149
- Discord message 1497775261293482136
- Discord message 1497775998748590262
- Discord message 1497776887936581702
- Discord message 1497778843451130057