Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-57454

Chunk donor propagates NotMaster error from recipient back to MigrationManager making it to believe donor is not primary

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.0.26
    • Affects Version/s: 4.0.25
    • Component/s: None
    • None
    • Fully Compatible
    • ALL
    • v4.0
    • Sharding 2021-06-14, Sharding 2021-06-28, Sharding 2021-07-26
    • 0

      Several places in the code e.g RemoteCommandTargeterRS propagate the NotMaster error received from remote ReplicaSet to mark the currently believed primary to be not primary anymore. This is a good optimization and it makes sense, except when the NotMaster error is propagated by the logical chain from yet another replica set.

      In this particular scenario the chain "MigrationManager -> rs1 -> rs0" the rs1 has a stale information on primary in rs0, it sends "moveChunk" request to rs0 that fails with NotMaster and the error is propagated to MigrationManager. Then config server invokes "failedHost()" on RSM for rs1 and marks it failed. Then it takes time before Config Manager's RSM resolves rs1 again. In the failed test scenario that was enough of a delay for the test to time out.

      In the BF rs0 is { 21021, 21022 }, rs1 is { 21023, 21024 }, config server { 21025, ... }. The relevant logs:

      2021-06-03T14:16:27.687+0000 d21022| 2021-06-03T14:16:27.687+0000 I REPL     [rsSync-0] transition to primary complete; database writes are now permitted
      
      2021-06-03T14:16:46.516+0000 I SHARDING [conn23] Starting chunk migration ns: test.change_stream_failover, [{ _id: 50.0 }, { _id: MaxKey }), fromShard: change_stream_shard_failover-rs1, toShard: change_stream_shard_failover-rs0 with expected collection version epoch 60b8e43e5847308423cf68a4
      
      d21023| 2021-06-03T14:16:46.536+0000 W SHARDING [conn23] Chunk move failed :: caused by :: NotMaster: not master
      
      c21025| 2021-06-03T14:16:46.542+0000 I NETWORK  [ShardRegistry] Marking host ip-10-122-75-111:21023 as no longer a primary :: caused by :: NotMaster: not master
      

            Assignee:
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            Reporter:
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: