Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-57454

Chunk donor propagates NotMaster error from recipient back to MigrationManager making it to believe donor is not primary

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 4.0.25
    • Fix Version/s: 4.0.26
    • Component/s: None
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.0
    • Sprint:
      Sharding 2021-06-14, Sharding 2021-06-28, Sharding 2021-07-26
    • Linked BF Score:
      38

      Description

      Several places in the code e.g RemoteCommandTargeterRS propagate the NotMaster error received from remote ReplicaSet to mark the currently believed primary to be not primary anymore. This is a good optimization and it makes sense, except when the NotMaster error is propagated by the logical chain from yet another replica set.

      In this particular scenario the chain "MigrationManager -> rs1 -> rs0" the rs1 has a stale information on primary in rs0, it sends "moveChunk" request to rs0 that fails with NotMaster and the error is propagated to MigrationManager. Then config server invokes "failedHost()" on RSM for rs1 and marks it failed. Then it takes time before Config Manager's RSM resolves rs1 again. In the failed test scenario that was enough of a delay for the test to time out.

      In the BF rs0 is { 21021, 21022 }, rs1 is { 21023, 21024 }, config server { 21025, ... }. The relevant logs:

      2021-06-03T14:16:27.687+0000 d21022| 2021-06-03T14:16:27.687+0000 I REPL     [rsSync-0] transition to primary complete; database writes are now permitted
       
      2021-06-03T14:16:46.516+0000 I SHARDING [conn23] Starting chunk migration ns: test.change_stream_failover, [{ _id: 50.0 }, { _id: MaxKey }), fromShard: change_stream_shard_failover-rs1, toShard: change_stream_shard_failover-rs0 with expected collection version epoch 60b8e43e5847308423cf68a4
       
      d21023| 2021-06-03T14:16:46.536+0000 W SHARDING [conn23] Chunk move failed :: caused by :: NotMaster: not master
       
      c21025| 2021-06-03T14:16:46.542+0000 I NETWORK  [ShardRegistry] Marking host ip-10-122-75-111:21023 as no longer a primary :: caused by :: NotMaster: not master
      

        Attachments

          Activity

            People

            Assignee:
            andrew.shuvalov Andrew Shuvalov
            Reporter:
            andrew.shuvalov Andrew Shuvalov
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: