Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.0.26
Affects Version/s: 4.0.25
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.0
Sprint:
Sharding 2021-06-14, Sharding 2021-06-28, Sharding 2021-07-26
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Several places in the code e.g RemoteCommandTargeterRS propagate the NotMaster error received from remote ReplicaSet to mark the currently believed primary to be not primary anymore. This is a good optimization and it makes sense, except when the NotMaster error is propagated by the logical chain from yet another replica set.

In this particular scenario the chain "MigrationManager -> rs1 -> rs0" the rs1 has a stale information on primary in rs0, it sends "moveChunk" request to rs0 that fails with NotMaster and the error is propagated to MigrationManager. Then config server invokes "failedHost()" on RSM for rs1 and marks it failed. Then it takes time before Config Manager's RSM resolves rs1 again. In the failed test scenario that was enough of a delay for the test to time out.

In the BF rs0 is { 21021, 21022 }, rs1 is { 21023, 21024 }, config server { 21025, ... }. The relevant logs:

2021-06-03T14:16:27.687+0000 d21022| 2021-06-03T14:16:27.687+0000 I REPL     [rsSync-0] transition to primary complete; database writes are now permitted

2021-06-03T14:16:46.516+0000 I SHARDING [conn23] Starting chunk migration ns: test.change_stream_failover, [{ _id: 50.0 }, { _id: MaxKey }), fromShard: change_stream_shard_failover-rs1, toShard: change_stream_shard_failover-rs0 with expected collection version epoch 60b8e43e5847308423cf68a4

d21023| 2021-06-03T14:16:46.536+0000 W SHARDING [conn23] Chunk move failed :: caused by :: NotMaster: not master

c21025| 2021-06-03T14:16:46.542+0000 I NETWORK  [ShardRegistry] Marking host ip-10-122-75-111:21023 as no longer a primary :: caused by :: NotMaster: not master

Assignee:: Andrew Shuvalov (Inactive)
Reporter:: Andrew Shuvalov (Inactive)
Participants:: Andrew Shuvalov, Githook User
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Jun 04 2021 06:24:26 PM UTC
Updated:: Oct 29 2023 09:52:37 PM UTC
Resolved:: Jul 16 2021 03:41:44 PM UTC
Confidence Status Last Update:: 16/Jul/21 3:41 PM

Details

Description

Attachments

Forms

Activity

People

Dates