[SERVER-56950] Avoid shardRegistry reload infinite loop when overlapping with setFCV Created: 14/May/21 Updated: 29/Oct/23 Resolved: 26/May/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 5.0.0, 4.9.0 |
| Fix Version/s: | 5.0.0-rc1, 5.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Pierlauro Sciarelli | Assignee: | Simon Gratzer (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | post-rc0, sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||||||
| Steps To Reproduce: | 1. ShardRegistry::_periodicReload causes a reload to occur. ShardRegistry::_getDataAsync advances the ReadThroughCache's timeInStore to some t1 with non-zero topologyTime. ReadThroughCache::acquireAsync creates an inProgressLookup with t1, and add a promise for it to inProgressLookup._outstanding. 2. ShardRegistry::_lookup starts running, and meanwhile the test runs setFCV from 4.9 to 4.4. 3. ShardRegistryData::createFromCatalogClient returns. useActualTopologyTime() is false so it returns the cached data's topology time (i.e. Timestamp(0,0) since this is the first reload) as result.t. 4. Inside ReadThroughCache::_doLookupWhileNotValid, inProgressLookup.getPromisesLessThanTime returns nothing because the first promise in _outstanding is the promise for t1 which has non-zero topologyTime (i.e. t1 > result.t) so the for loop breaks early here. 5. The promisesToSet is empty so mustDoAnotherLoop is true. The _inProgressLookup for t1 remains in the cache, and another round of lookup starts, again no promises can be fulfilled because of 4. 6. Future reloads join this infinitely looping inProgressLookup. (That's why in the hang analyzer output, there are multiple mongo::ShardRegistry::_periodicReload threads). |
||||||||||||||||
| Sprint: | Sharding EMEA 2021-05-31 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 170 | ||||||||||||||||
| Description |
|
When setFCV(v4.4) overlaps with a ShardRegistry reload - right after the useActualTopologyTime check - the ShardRegistry can fall into an infinite loop of lookups because the topology time is not gossiped after the setFCV succeeds. Purpose of this ticket is to avoid this overlap to result in a livelock. |
| Comments |
| Comment by Vivian Ge (Inactive) [ 06/Oct/21 ] |
|
Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you! |
| Comment by Githook User [ 01/Jun/21 ] |
|
Author: {'name': 'Simon Gratzer', 'email': 'simon.gratzer@mongodb.com'}Message: |
| Comment by Githook User [ 26/May/21 ] |
|
Author: {'name': 'Simon Gratzer', 'email': 'simon.gratzer@mongodb.com'}Message: This reverts commit c6ebe28e7ed60bdb8675204144bbb765891a4ca2. |
| Comment by Githook User [ 25/May/21 ] |
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: Revert " This reverts commit 5ffdb69a0d691549c0d6cd780c2d8be238e588a6. |
| Comment by Githook User [ 21/May/21 ] |
|
Author: {'name': 'Simon Gratzer', 'email': 'simon.gratzer@mongodb.com'}Message: |
| Comment by Simon Gratzer (Inactive) [ 21/May/21 ] |