-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 4.0.25
-
Component/s: None
-
None
-
Fully Compatible
-
ALL
-
v4.0
-
Sharding 2021-06-14, Sharding 2021-06-28, Sharding 2021-07-26
-
0
Several places in the code e.g RemoteCommandTargeterRS propagate the NotMaster error received from remote ReplicaSet to mark the currently believed primary to be not primary anymore. This is a good optimization and it makes sense, except when the NotMaster error is propagated by the logical chain from yet another replica set.
In this particular scenario the chain "MigrationManager -> rs1 -> rs0" the rs1 has a stale information on primary in rs0, it sends "moveChunk" request to rs0 that fails with NotMaster and the error is propagated to MigrationManager. Then config server invokes "failedHost()" on RSM for rs1 and marks it failed. Then it takes time before Config Manager's RSM resolves rs1 again. In the failed test scenario that was enough of a delay for the test to time out.
In the BF rs0 is { 21021, 21022 }, rs1 is { 21023, 21024 }, config server { 21025, ... }. The relevant logs:
2021-06-03T14:16:27.687+0000 d21022| 2021-06-03T14:16:27.687+0000 I REPL [rsSync-0] transition to primary complete; database writes are now permitted 2021-06-03T14:16:46.516+0000 I SHARDING [conn23] Starting chunk migration ns: test.change_stream_failover, [{ _id: 50.0 }, { _id: MaxKey }), fromShard: change_stream_shard_failover-rs1, toShard: change_stream_shard_failover-rs0 with expected collection version epoch 60b8e43e5847308423cf68a4 d21023| 2021-06-03T14:16:46.536+0000 W SHARDING [conn23] Chunk move failed :: caused by :: NotMaster: not master c21025| 2021-06-03T14:16:46.542+0000 I NETWORK [ShardRegistry] Marking host ip-10-122-75-111:21023 as no longer a primary :: caused by :: NotMaster: not master