Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.0.0-rc1, 5.1.0-rc0
Affects Version/s: 5.0.0, 4.9.0
Component/s: None
Labels:
- post-rc0
- sharding-wfbf-day

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.0
Steps To Reproduce:

Hide

1. ShardRegistry::_periodicReload causes a reload to occur. ShardRegistry::_getDataAsync advances the ReadThroughCache's timeInStore to some t1 with non-zero topologyTime. ReadThroughCache::acquireAsync creates an inProgressLookup with t1, and add a promise for it to inProgressLookup._outstanding.

2. ShardRegistry::_lookup starts running, and meanwhile the test runs setFCV from 4.9 to 4.4.

3. ShardRegistryData::createFromCatalogClient returns. useActualTopologyTime() is false so it returns the cached data's topology time (i.e. Timestamp(0,0) since this is the first reload) as result.t.

4. Inside ReadThroughCache::_doLookupWhileNotValid, inProgressLookup.getPromisesLessThanTime returns nothing because the first promise in _outstanding is the promise for t1 which has non-zero topologyTime (i.e. t1 > result.t) so the for loop breaks early here.

5. The promisesToSet is empty so mustDoAnotherLoop is true. The _inProgressLookup for t1 remains in the cache, and another round of lookup starts, again no promises can be fulfilled because of 4.

6. Future reloads join this infinitely looping inProgressLookup. (That's why in the hang analyzer output, there are multiple mongo::ShardRegistry::_periodicReload threads).

Show
1. ShardRegistry::_periodicReload causes a reload to occur. ShardRegistry::_getDataAsync advances the ReadThroughCache's timeInStore to some t1 with non-zero topologyTime. ReadThroughCache::acquireAsync creates an inProgressLookup with t1, and add a promise for it to inProgressLookup._outstanding . 2. ShardRegistry::_lookup starts running, and meanwhile the test runs setFCV from 4.9 to 4.4. 3. ShardRegistryData::createFromCatalogClient returns. useActualTopologyTime() is false so it returns the cached data's topology time (i.e. Timestamp(0,0) since this is the first reload) as result.t. 4. Inside ReadThroughCache::_doLookupWhileNotValid, inProgressLookup.getPromisesLessThanTime returns nothing because the first promise in _outstanding is the promise for t1 which has non-zero topologyTime (i.e. t1 > result.t ) so the for loop breaks early here . 5. The promisesToSet is empty so mustDoAnotherLoop is true. The _inProgressLookup for t1 remains in the cache, and another round of lookup starts, again no promises can be fulfilled because of 4. 6. Future reloads join this infinitely looping inProgressLookup. (That's why in the hang analyzer output, there are multiple mongo::ShardRegistry::_periodicReload threads).
Sprint:
Sharding EMEA 2021-05-31
Linked BF Score:
170
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When setFCV(v4.4) overlaps with a ShardRegistry reload - right after the useActualTopologyTime check - the ShardRegistry can fall into an infinite loop of lookups because the topology time is not gossiped after the setFCV succeeds.

Purpose of this ticket is to avoid this overlap to result in a livelock.

is depended on by

SERVER-57017 Enable sharded DDL plus FCV FSM in stepdown suites

Closed

Assignee:: Simon Gratzer (Inactive)
Reporter:: Pierlauro Sciarelli
Participants:: Githook User, Pierlauro Sciarelli, Simon Gratzer, Vivian Ge
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: May 14 2021 11:52:40 AM UTC
Updated:: Jul 16 2024 04:07:37 PM UTC
Resolved:: May 26 2021 02:43:16 PM UTC
Confidence Status Last Update:: 21/May/21 8:13 AM

Details

Description

Attachments

Issue Links

Activity

People

Dates