Distil SN97 Infrastructure Unstable After 18-Hour Outage
Share
SN97 experienced a critical 18-hour outage beginning ~14:00 UTC after infrastructure issues cascaded: restart loops, permission failures preventing state persistence, and a broken h2h_latest.json file stuck at 18+ hours old. Partial recovery occurred ~20:30 UTC, but fundamental bugs remain unfixed—king selection is broken (state file desynchronized from on-chain), model loads hang for 50+ minutes, and rounds still take 5-6 hours instead of target 2-3 hours. UID 26 recently dethroned UID 48, but on-chain and local state are out of sync, setting up repeated failures on next round.
- •18-hour outage: h2h_latest.json permissions bug prevented rounds from starting or persisting state.
- •King stuck: h2h_latest.json frozen at block 8042152; update_h2h_state() not writing new data between rounds.
- •Model load hang: eval blocked for 52+ min on rl00re/sn97-w325; likely corruption or oversized model.
- •Slow evals: 5-6 hours/round (benchmarks still running); target is 2-3 hours. Bench battery not actually disabled.
- •State sync bug: UID 26 is on-chain king, but local files still show UID 48; next round will repeat 'no king' failure.
Distilled from 137 team messages in the official Bittensor Discord. Generated by Claude Haiku 4.5.
View original messages
- Discord message 1497608113006841878
- Discord message 1497608316506079294
- Discord message 1497609253832102009
- Discord message 1497609254880809101
- Discord message 1497610723881390194
- Discord message 1497611639514730527
- Discord message 1497612735805001878
- Discord message 1497613628541767893
- Discord message 1497613793696678069
- Discord message 1497613887929843772
- Discord message 1497614526881988710
- Discord message 1497615073080774847
- Discord message 1497615421023715429
- Discord message 1497617072933568552