[SERVER-57454] Chunk donor propagates NotMaster error from recipient back to MigrationManager making it to believe donor is not primary Created: 04/Jun/21  Updated: 29/Oct/23  Resolved: 16/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.25
Fix Version/s: 4.0.26

Type: Bug Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Andrew Shuvalov (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0
Sprint: Sharding 2021-06-14, Sharding 2021-06-28, Sharding 2021-07-26
Participants:
Linked BF Score: 0

 Description   

Several places in the code e.g RemoteCommandTargeterRS propagate the NotMaster error received from remote ReplicaSet to mark the currently believed primary to be not primary anymore. This is a good optimization and it makes sense, except when the NotMaster error is propagated by the logical chain from yet another replica set.

In this particular scenario the chain "MigrationManager -> rs1 -> rs0" the rs1 has a stale information on primary in rs0, it sends "moveChunk" request to rs0 that fails with NotMaster and the error is propagated to MigrationManager. Then config server invokes "failedHost()" on RSM for rs1 and marks it failed. Then it takes time before Config Manager's RSM resolves rs1 again. In the failed test scenario that was enough of a delay for the test to time out.

In the BF rs0 is { 21021, 21022 }, rs1 is { 21023, 21024 }, config server { 21025, ... }. The relevant logs:

2021-06-03T14:16:27.687+0000 d21022| 2021-06-03T14:16:27.687+0000 I REPL     [rsSync-0] transition to primary complete; database writes are now permitted
 
2021-06-03T14:16:46.516+0000 I SHARDING [conn23] Starting chunk migration ns: test.change_stream_failover, [{ _id: 50.0 }, { _id: MaxKey }), fromShard: change_stream_shard_failover-rs1, toShard: change_stream_shard_failover-rs0 with expected collection version epoch 60b8e43e5847308423cf68a4
 
d21023| 2021-06-03T14:16:46.536+0000 W SHARDING [conn23] Chunk move failed :: caused by :: NotMaster: not master
 
c21025| 2021-06-03T14:16:46.542+0000 I NETWORK  [ShardRegistry] Marking host ip-10-122-75-111:21023 as no longer a primary :: caused by :: NotMaster: not master



 Comments   
Comment by Githook User [ 14/Jun/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-57454: NotMaster error from chunk recipient is not propagated back to config server
Branch: v4.0
https://github.com/mongodb/mongo/commit/593b212f742775144c921fe62b0f3ddfb7125e86

Comment by Andrew Shuvalov (Inactive) [ 09/Jun/21 ]

This is minor performance bug, it also makes very confusing logs when investigating because the NotMaster error is coming from a wrong server. I think porting it to head would be sufficient for now.

Generated at Thu Feb 08 05:41:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.