-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Fully Compatible
-
ALL
-
Server Serverless 2023-04-17, Server Serverless 2023-05-01, Server Serverless 2023-05-15, Server Serverless 2023-05-29, Server Serverless 2023-06-12, Server Serverless 2023-06-26
-
5
In the BF, after a donor failover , we see the new donor primary tries to wait on a opTime for all recipient nodes to reach after the recipient nodes have installed split config, . And, that results in split timeout and causing the split to abort with "ErrorCodes: ExceededTimeLimit" which the test suite (shard_split_stepdown_jscore_passthrough) isn't expecting.
Either we should fix the problem by adding some markers in the donor state document after all recipient nodes are caught up to blockTS (i.e, something here) and can be used to decide whether to skip the "waiting for BlockTS" stage or not (or) make the suites which involves step down in combo with shard split (shard_split_kill_primary_jscore_passthrough, shard_split_terminate_primary_jscore_passthrough, shard_split_stepdown_jscore_passthrough) to ignore such "ErrorCodes: ExceededTimeLimit" errors.
But to be noted, in the shard split scope, we have this goal has completed
Be resilient to failover including elections, node restarts, and transient network errors.
In case if we are going with the latter solution, we should inform the Cloud that shard split is not completely resilient to failovers + make a note in the scope document + update the arch guide if needed.