[SERVER-65714] ShardSplitDonorServiceTest fails due to a race condition when responding to recipient node monitor scheduled requests Created: 15/Apr/22  Updated: 06/Dec/22  Resolved: 26/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mathis Bessa Assignee: [DO NOT USE] Backlog - Server Serverless (Inactive)
Resolution: Won't Do Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-65218 Run ShardSplitDonorServiceTest with w... Closed
Assigned Teams:
Serverless
Operating System: ALL
Sprint: Server Serverless 2022-04-18
Participants:
Linked BF Score: 34

 Description   

There is currently a race condition in the BasicShardSplitDonorServiceInstanceCreation which can potentially lead to only one recipient node monitor being created and we would immediately start processing the Scheduled Requests however we expect to have all of the requests ready and in this case only one is available yet and an assert is thrown when trying to process the next one.

The solution is simply to only start responding to scheduled request once all the recipient node monitors are created and all the requests are scheduled.

A solution would be to keep track of the hosts and hello requests in order to avoid potential missed hello responses. We should make sure each host as responded at least once in order to validate all the expected hello requests.

  • Map the responded heartbeats to the hosts
  • keep processing heartbeats all hosts have responded to at least one heartbeat request
  • stop processing heartbeats once all the hosts have responded at least once or if the test waited too long to receive a request.


 Comments   
Comment by Matt Broadstone [ 26/May/22 ]

This ticket now tracks a suggested improvement to a shard split unit test. I'm closing this since we haven't seen any failures specific to the issue the improvement tries to address.

Generated at Thu Feb 08 06:03:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.