[SERVER-54959] Prevent cursors established by ReshardingCollectionCloner from being timed out while in use on some shards Created: 04/Mar/21 Updated: 29/Oct/23 Resolved: 22/Apr/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 5.0.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Max Hirschhorn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | PM-234-M2.5, PM-234-T-data-clone | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Sprint: | Sharding 2021-03-22, Sharding 2021-04-05, Sharding 2021-04-19, Sharding 2021-05-03 | ||||
| Participants: | |||||
| Story Points: | 2 | ||||
| Description |
|
A shard could have only a small number of documents meant for a particular recipient and therefore return batches infrequently. If a batch isn't returned within 10 minutes then the cursors which are idle on the other donor shards will time out. We should therefore set noCursorTimeout for the aggregation request sent to all donor shards in the ReshardingCollectionCloner. |
| Comments |
| Comment by Githook User [ 22/Apr/21 ] | ||||||||||||||||||||||||
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: | ||||||||||||||||||||||||
| Comment by Max Hirschhorn [ 07/Apr/21 ] | ||||||||||||||||||||||||
|
It looks like #1 is sufficient on its own and that doing #2 isn't needed (referring to my earlier comment). Jason Carey pointed me to this part of LogicalSessionCacheImpl::_refresh() which bumps the lastUse for any logical session associated with an active operation context. This means so long as a cursor for the collection cloning pipeline is active on some shard, then the cursors across all shards will be kept alive. | ||||||||||||||||||||||||
| Comment by Max Hirschhorn [ 23/Mar/21 ] | ||||||||||||||||||||||||
|
The following satisfies #1, although I think it'd be better for ReshardingCollectionCloner to receive the LogicalSessionId as an argument to the constructor. This is because implementing #2 is going to require having a dedicated thread for running LogicalSessionCache::vivify(lsid) while the thread in ReshardingCollectionCloner::run() is blocked waiting for a result from Pipeline::getNext(). If a new logical session ID is generated each time ReshardingCollectionCloner::_restartPipeline() is called then that other thread needs to know about the new logical session ID too so it can switch which logical session it is refreshing.
| ||||||||||||||||||||||||
| Comment by Max Hirschhorn [ 23/Mar/21 ] | ||||||||||||||||||||||||
|
I hadn't realized aggregation cursors don't support the noCursorTimeout option. The changes from
| ||||||||||||||||||||||||
| Comment by Max Hirschhorn [ 04/Mar/21 ] | ||||||||||||||||||||||||
|
One thought as part of testing this change would be to use the "cursor.open.noTimeout" serverStatus metric reported by mongod by having collection cloning paused with the cursors still open. We can use the same trick from resharding_clones_duplicate_key.js to use large documents to ensure there's more than one batch of documents to clone. |