Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-60344

Action plan on lagging setFCV replicas breaking tests

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.0.28
    • Affects Version/s: None
    • Component/s: None
    • Labels:
    • Fully Compatible
    • ALL
    • Sharding 2021-10-04, Sharding 2021-10-18
    • 145

      This ticket is to discuss and decide on proper actions on test breakages investigated by cheahuychou.mao:

      1. The setFeatureCompatibilityVersion() replicates to majority committed. It means when the command succeeds some replicas may still lag on the desired version.
      2. When Mongos detects incompatible version is check-fails.
      3. The Mongos connection pre-warm feature amplifies this race because now Mongos has a feature to pre-warm connections at startup, which includes non-primaries. A replica may still lag on set FCV() and causes Mongos to crash

      There are several questions we need to resolve before addressing this.

      Is set FCV() majority commit correct or we want to replicate to all replicas first?

      • My opinion it is correct. What an SRE wants from the system is that changing FCV is fast and durable. The condition when SRE performs upgrade and downgrade is usually not the case when it's a good time to reflex on complications when set FCV does not return or times out. What SRE wants is to get the confirmation and move forward with next actions.

      Is it correct that Mongos pre-warms connections to non primaries?

      • My opinion: it might be tricky. Mongos doesn't know the primary, it gets the¬†Grid::get() information and iterates over `shard.getHost()... getServers()` to pre-warm. Should it take additional wait time to find out just the primary, so make it 2-step procedure? Do we need to make the code that cumbersome?


      The proposed action:

      1. make a best effort to find a simple solution to pre-warming the connections to the primary only.
      2. If it gets unclean and/or requires extra overhead comparing to the existing code, just fix the tests.

            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            0 Vote for this issue
            5 Start watching this issue