[SERVER-56917] Stuck Hello request may lead to cluster outage Created: 13/May/21  Updated: 27/Oct/23  Resolved: 16/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Andrew Shuvalov (Inactive)
Resolution: Works as Designed Votes: 0
Labels: sharding-product-sync
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File hello_delay.diff    
Issue Links:
Backports
Related
is related to SERVER-56854 Provide the ability for RSM requests ... Closed
Operating System: ALL
Steps To Reproduce:

Apply diff and run:

buildscripts/resmoke.py run --suite=sharding_hello_failures jstests/concurrency/fsm_workloads/update_array.js

 

 

Sprint: Sharding 2021-07-12, Sharding 2021-07-26
Participants:

 Description   

Background

We had an outage in 4.0 cluster when a hardware/OS outage at one shard primary server manifested as stuck Hello request resulted to sharded cluster outage. The culprit was in RSM that was blocked in thread and accumulated unprocessed requests eventually becoming unresponsive for all shards.

In order to verify similar vulnerability in other branches the new pass-through test was also ported to current head (5.0). The test shows somewhat different, and yet critical bug.

Test results

If the attached diff is applied, the following RSM outage is reproduced:

  • Fail injection is configured to delay Hello response indefinitely at primary
  • Primary is forced to step down and new primary is forced to step up
  • Mongos is unable to detect new primary by entering an infinite loop:
  • Hello request to old (dysfunctional) primary fails with NetworkInterfaceExceededTimeLimit
  • RSM starts another Hello to the same server without even trying other servers

Apparently all our branches are vulnerable to this bug one way or another. This ticket is for 5.0 that should be backported to at least 4.4. The fix for 4.0 and 4.2 is a separate SERVER-56854 as the code and fix are quite different.



 Comments   
Comment by Andrew Shuvalov (Inactive) [ 16/Jul/21 ]

The integration test was ported to other branches and proved that the bug existed only in the 4.0 branch.  In other branches, the root cause of the suspected problem was that the fail injection was preventing the test infrastructure to setup properly, which was hard to separate from the actual bug.

Comment by Lamont Nelson [ 13/May/21 ]

SERVER-56854 is the 4.0 and 4.2 version of this ticket, which is being worked on right now.

Generated at Thu Feb 08 05:40:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.