[SERVER-32783] CollectionCloner::shutdown() should not block on resetting _verifyCollectionDroppedScheduler Created: 18/Jan/18  Updated: 30/Oct/23  Resolved: 26/Jan/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.6.3, 3.7.2

Type: Bug Priority: Major - P3
Reporter: Benety Goh Assignee: Benety Goh
Resolution: Fixed Votes: 0
Labels: initialSync
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-31267 CollectionCloner fails if collection ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2018-01-29
Participants:
Linked BF Score: 0

 Description   

CollectionCloner::shutdown() may block waiting for the implicit join inside RemoteCommandRetryScheduler's destructor (while calling reset() in "_verifyCollectionDroppedScheduler").

https://github.com/mongodb/mongo/blob/cec131017b16e12b993ef8c90733feed4081fe0d/src/mongo/db/repl/collection_cloner.cpp?#L259

collection_cloner.cpp

void CollectionCloner::_cancelRemainingWork_inlock() {
    if (_arm) {
        Client::initThreadIfNotAlready();
        _killArmHandle = _arm->kill(cc().getOperationContext());
    }
    _countScheduler.shutdown();
    _listIndexesFetcher.shutdown();
    if (_establishCollectionCursorsScheduler) {
        _establishCollectionCursorsScheduler->shutdown();
    }
    if (_verifyCollectionDroppedScheduler) {
        _verifyCollectionDroppedScheduler->shutdown();
        _verifyCollectionDroppedScheduler.reset();
    }
    _dbWorkTaskRunner.cancel();
}



 Comments   
Comment by Githook User [ 11/Feb/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: SERVER-32783 remove unnecessary scheduleWork call from CollectionCloner::_verifyCollectionDropped()

(cherry picked from commit 785f56934fcb09f121980ccf6c51d97c3af80fa2)
Branch: v3.6
https://github.com/mongodb/mongo/commit/a854e4a768da44f820f13742caba92486dc5e58d

Comment by Githook User [ 10/Feb/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: SERVER-32783 CollectionCloner::shutdown() does not wait for _verifyCollectionDropped destruction

(cherry picked from commit 73f0e0047afb1f0c0965e4c5e540decdf92c9a72)
Branch: v3.6
https://github.com/mongodb/mongo/commit/e0c0a12a2894a7bf26a111515e9f5a2e725699dd

Comment by Githook User [ 10/Feb/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: SERVER-32783 add test case for CollectionCloner handling collection drops while copying documents

(cherry picked from commit b3e32fea3fb27391fce4b170b4dcec1f25b780e4)
Branch: v3.6
https://github.com/mongodb/mongo/commit/09c9ef7ecdd7ef0cf20f2ac41cee04b94dd8ff02

Comment by Githook User [ 10/Feb/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: SERVER-32783 fix race in RemoteCommandRetryScheduler between shutdown() and resending command

(cherry picked from commit acf7bec77edde339ed6fb1bb89f7f03888144476)
Branch: v3.6
https://github.com/mongodb/mongo/commit/c317860cf6ee4a03be65ee2d79c5b467dc9dd574

Comment by Githook User [ 09/Feb/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: SERVER-32783 make RemoteCommandRetryScheduler single-use - cannot be be restarted once completed

(cherry picked from commit 35b9b4287581fdc9f37d3afeebfb2c9895b2428b)
Branch: v3.6
https://github.com/mongodb/mongo/commit/3968e970161872b78fa88ad65a8785b54694c126

Comment by Githook User [ 09/Feb/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: SERVER-32783 RemoteCommandRetryScheduler releases callback resources on completion

(cherry picked from commit a046f953101dc64af42da2c72e79a11098f76a7e)
Branch: v3.6
https://github.com/mongodb/mongo/commit/d0f5a227e98b1415b549abc55ae6f77ac5e31c48

Comment by Githook User [ 09/Feb/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: SERVER-32783 reduce unnecessary lock acqisition in RemoteCommandRetryScheduler::_remoteCommandCallback()

(cherry picked from commit 8b44a736464e31e2a38e40171cb34063f180171c)
Branch: v3.6
https://github.com/mongodb/mongo/commit/862d39eec0130c16eb984f799911323df766177f

Comment by Githook User [ 26/Jan/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-32783 remove unnecessary scheduleWork call from CollectionCloner::_verifyCollectionDropped()
Branch: master
https://github.com/mongodb/mongo/commit/785f56934fcb09f121980ccf6c51d97c3af80fa2

Comment by Githook User [ 25/Jan/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-32783 CollectionCloner::shutdown() does not wait for _verifyCollectionDropped destruction
Branch: master
https://github.com/mongodb/mongo/commit/73f0e0047afb1f0c0965e4c5e540decdf92c9a72

Comment by Githook User [ 25/Jan/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-32783 add test case for CollectionCloner handling collection drops while copying documents
Branch: master
https://github.com/mongodb/mongo/commit/b3e32fea3fb27391fce4b170b4dcec1f25b780e4

Comment by Githook User [ 25/Jan/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-32783 fix race in RemoteCommandRetryScheduler between shutdown() and resending command
Branch: master
https://github.com/mongodb/mongo/commit/acf7bec77edde339ed6fb1bb89f7f03888144476

Comment by Githook User [ 25/Jan/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-32783 RemoteCommandRetryScheduler releases callback resources on completion
Branch: master
https://github.com/mongodb/mongo/commit/a046f953101dc64af42da2c72e79a11098f76a7e

Comment by Githook User [ 25/Jan/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-32783 make RemoteCommandRetryScheduler single-use - cannot be be restarted once completed
Branch: master
https://github.com/mongodb/mongo/commit/35b9b4287581fdc9f37d3afeebfb2c9895b2428b

Comment by Githook User [ 25/Jan/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-32783 reduce unnecessary lock acqisition in RemoteCommandRetryScheduler::_remoteCommandCallback()
Branch: master
https://github.com/mongodb/mongo/commit/8b44a736464e31e2a38e40171cb34063f180171c

Comment by Matthew Russotto [ 19/Jan/18 ]

I think just removing the .reset() should be sufficient; we never re-use the CollectionCloner after cancelling remaining work, right? So the only effect of removing the reset will be that any outstanding calls to _verifyCollectionWasDropped will return immediately, which is correct behavior. I believe this will also allow the extra scheduleWork at line 870 (in 3.6) to be removed.

The deadlock is that the callback to _verifyCollectionDroppedScheduler requires the CollectionCloner lock, but _verifyCollectionDroppedScheduler.join() won't complete until all callbacks are finished running, and _cancelRemainingWork_inlock() holds the CollectionCloner lock.

Comment by Spencer Brody (Inactive) [ 18/Jan/18 ]

benety.goh, do you have a sense how tricky this would be to fix?

Generated at Thu Feb 08 04:31:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.