[SERVER-56917] Stuck Hello request may lead to cluster outage Created: 13/May/21 Updated: 27/Oct/23 Resolved: 16/Jul/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 5.0.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Andrew Shuvalov (Inactive) | Assignee: | Andrew Shuvalov (Inactive) |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | sharding-product-sync | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: | Apply diff and run: buildscripts/resmoke.py run --suite=sharding_hello_failures jstests/concurrency/fsm_workloads/update_array.js
|
||||||||||||
| Sprint: | Sharding 2021-07-12, Sharding 2021-07-26 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Background We had an outage in 4.0 cluster when a hardware/OS outage at one shard primary server manifested as stuck Hello request resulted to sharded cluster outage. The culprit was in RSM that was blocked in thread and accumulated unprocessed requests eventually becoming unresponsive for all shards. In order to verify similar vulnerability in other branches the new pass-through test was also ported to current head (5.0). The test shows somewhat different, and yet critical bug. Test results If the attached diff is applied, the following RSM outage is reproduced:
Apparently all our branches are vulnerable to this bug one way or another. This ticket is for 5.0 that should be backported to at least 4.4. The fix for 4.0 and 4.2 is a separate |
| Comments |
| Comment by Andrew Shuvalov (Inactive) [ 16/Jul/21 ] |
|
The integration test was ported to other branches and proved that the bug existed only in the 4.0 branch. In other branches, the root cause of the suspected problem was that the fail injection was preventing the test infrastructure to setup properly, which was hard to separate from the actual bug. |
| Comment by Lamont Nelson [ 13/May/21 ] |
|
|