Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-56917

Stuck Hello request may lead to cluster outage

    • Type: Icon: Bug Bug
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 5.0.0
    • Component/s: None
    • ALL
    • Hide

      Apply diff and run:

      buildscripts/resmoke.py run --suite=sharding_hello_failures jstests/concurrency/fsm_workloads/update_array.js



      Apply diff and run: buildscripts/resmoke.py run --suite=sharding_hello_failures jstests/concurrency/fsm_workloads/update_array.js    
    • Sharding 2021-07-12, Sharding 2021-07-26


      We had an outage in 4.0 cluster when a hardware/OS outage at one shard primary server manifested as stuck Hello request resulted to sharded cluster outage. The culprit was in RSM that was blocked in thread and accumulated unprocessed requests eventually becoming unresponsive for all shards.

      In order to verify similar vulnerability in other branches the new pass-through test was also ported to current head (5.0). The test shows somewhat different, and yet critical bug.

      Test results

      If the attached diff is applied, the following RSM outage is reproduced:

      • Fail injection is configured to delay Hello response indefinitely at primary
      • Primary is forced to step down and new primary is forced to step up
      • Mongos is unable to detect new primary by entering an infinite loop:
      • Hello request to old (dysfunctional) primary fails with NetworkInterfaceExceededTimeLimit
      • RSM starts another Hello to the same server without even trying other servers

      Apparently all our branches are vulnerable to this bug one way or another. This ticket is for 5.0 that should be backported to at least 4.4. The fix for 4.0 and 4.2 is a separate SERVER-56854 as the code and fix are quite different.

            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            0 Vote for this issue
            9 Start watching this issue