Resolution: Works as Designed
Priority: Major - P3
Affects Version/s: 5.0.0
Sharding 2021-07-12, Sharding 2021-07-26
We had an outage in 4.0 cluster when a hardware/OS outage at one shard primary server manifested as stuck Hello request resulted to sharded cluster outage. The culprit was in RSM that was blocked in thread and accumulated unprocessed requests eventually becoming unresponsive for all shards.
In order to verify similar vulnerability in other branches the new pass-through test was also ported to current head (5.0). The test shows somewhat different, and yet critical bug.
If the attached diff is applied, the following RSM outage is reproduced:
- Fail injection is configured to delay Hello response indefinitely at primary
- Primary is forced to step down and new primary is forced to step up
- Mongos is unable to detect new primary by entering an infinite loop:
- Hello request to old (dysfunctional) primary fails with NetworkInterfaceExceededTimeLimit
- RSM starts another Hello to the same server without even trying other servers
Apparently all our branches are vulnerable to this bug one way or another. This ticket is for 5.0 that should be backported to at least 4.4. The fix for 4.0 and 4.2 is a separate
SERVER-56854 as the code and fix are quite different.