Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 5.0.0
Component/s: None
Labels:
- sharding-product-sync

Operating System:
ALL
Steps To Reproduce:

Hide

Apply diff and run:

buildscripts/resmoke.py run --suite=sharding_hello_failures jstests/concurrency/fsm_workloads/update_array.js

Show
Apply diff and run: buildscripts/resmoke.py run --suite=sharding_hello_failures jstests/concurrency/fsm_workloads/update_array.js
Sprint:
Sharding 2021-07-12, Sharding 2021-07-26
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Background

We had an outage in 4.0 cluster when a hardware/OS outage at one shard primary server manifested as stuck Hello request resulted to sharded cluster outage. The culprit was in RSM that was blocked in thread and accumulated unprocessed requests eventually becoming unresponsive for all shards.

In order to verify similar vulnerability in other branches the new pass-through test was also ported to current head (5.0). The test shows somewhat different, and yet critical bug.

Test results

If the attached diff is applied, the following RSM outage is reproduced:

Fail injection is configured to delay Hello response indefinitely at primary
Primary is forced to step down and new primary is forced to step up
Mongos is unable to detect new primary by entering an infinite loop:
Hello request to old (dysfunctional) primary fails with NetworkInterfaceExceededTimeLimit
RSM starts another Hello to the same server without even trying other servers

Apparently all our branches are vulnerable to this bug one way or another. This ticket is for 5.0 that should be backported to at least 4.4. The fix for 4.0 and 4.2 is a separate ~~SERVER-56854~~ as the code and fix are quite different.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

hello_delay.diff
11 kB
May 13 2021 03:06:11 PM UTC

is related to

SERVER-56854 Provide the ability for RSM requests to timeout and mark the server as failed

Closed

Assignee:: Andrew Shuvalov (Inactive)
Reporter:: Andrew Shuvalov (Inactive)
Participants:: Andrew Shuvalov, Lamont Nelson
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: May 13 2021 03:04:01 PM UTC
Updated:: Oct 27 2023 01:52:23 PM UTC
Resolved:: Jul 16 2021 02:40:43 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates