[SERVER-38580] Tighten the check around when the `refreshing` flag is supposed to have been cleared on a secondary node Created: 12/Dec/18  Updated: 06/Dec/22  Resolved: 18/Feb/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Assigned Teams:
Sharding EMEA
Sprint: Sharding 2018-12-31, Sharding 2019-01-14, Sharding 2019-02-25, Sharding 2019-03-11, Sharding 2019-03-25, Sharding 2019-05-06, Sharding 2019-05-20
Participants:
Linked BF Score: 8

 Description   

From a failed Evergreen run we have seen a case where un-setting of the 'refreshing' flag did not happen somehow on the secondaries, even though the refresh succeeded on the primaries.

It is extremely unlikely that the two writes in persistCollectionAndChangedChunks failed and without this, there is no explanation of why the notification here never got signaled.

We should change the wait loop in _getCompletePersistedMetadataForSecondarySinceVersion to have a timeout of a few seconds between loop and after which it should check whether the optime returned by the primary's refresh has been reached and assert that the refreshing flag has been cleared. If it hasn't been cleared, it should log the contents of the config.cache.collections entry for the collection being refreshed and failing the refresh (which will cause the client calls to fail).

This should help us build further hypothesis about this issue.


Generated at Thu Feb 08 04:49:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.