[SERVER-75315] Shard split donor tries to wait on a opTime for all recipient nodes to reach after the recipient nodes have installed split config. Created: 27/Mar/23  Updated: 29/Oct/23  Resolved: 23/Jun/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Matt Broadstone
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Server Serverless 2023-04-17, Server Serverless 2023-05-01, Server Serverless 2023-05-15, Server Serverless 2023-05-29, Server Serverless 2023-06-12, Server Serverless 2023-06-26
Participants:
Linked BF Score: 5

 Description   

In the BF, after a donor failover , we see the new donor primary tries to wait on a opTime for all recipient nodes to reach after the recipient nodes have installed split config, . And, that results in split timeout and causing the split to abort with "ErrorCodes: ExceededTimeLimit" which the test suite (shard_split_stepdown_jscore_passthrough) isn't expecting.

Either we should fix the problem by adding some markers in the donor state document after all recipient nodes are caught up to blockTS  (i.e, something  here)  and can be used to decide whether to skip  the "waiting for BlockTS" stage or not (or) make the suites which involves step down in combo with shard split (shard_split_kill_primary_jscore_passthrough, shard_split_terminate_primary_jscore_passthrough, shard_split_stepdown_jscore_passthrough) to ignore such "ErrorCodes: ExceededTimeLimit" errors.

But to be noted, in the shard split scope, we have this goal has completed

Be resilient to failover including elections, node restarts, and transient network errors.

In case if we are going with the latter solution, we should inform the Cloud that shard split is not completely resilient to failovers + make a note in the scope document + update the arch guide if needed.



 Comments   
Comment by Githook User [ 23/Jun/23 ]

Author:

{'name': 'Matt Broadstone', 'email': 'mbroadst@mongodb.com', 'username': 'mbroadst'}

Message: SERVER-75315 Record when all recipients reach block opTime
Branch: master
https://github.com/mongodb/mongo/commit/1622e1948026489a99e4f34287dcd3cd2ada0039

Comment by Didier Nadeau [ 27/Mar/23 ]

suganthi.mani@mongodb.com could you write a description of the bug please ?

Generated at Thu Feb 08 06:29:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.