[SERVER-26332] remove unnecessary check that RangeDeleter has run after removeShard in remove2.js Created: 26/Sep/16  Updated: 19/Nov/16  Resolved: 05/Oct/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.3.14
Fix Version/s: 3.4.0-rc1

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Esha Maharishi (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-8836 sharding/remove2.js failing on Window... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2016-10-10
Participants:

 Description   

The test removes a shard and tries to put it back again. However, it is possible that the migration cleanup task has not yet completed so attempting to re-add the shard will result to "local database exist in another shard" error.

The test attempts to prevent this by checking the currOp for the cleanup task:
https://github.com/mongodb/mongo/blob/r3.3.14/jstests/sharding/remove2.js#L26-L30

But the issue here is that the balancer can run with

{ _waitForDelete: true }

and this will mean that the cleanup will happen inline with moveChunk and the check above will not catch this.



 Comments   
Comment by Githook User [ 05/Oct/16 ]

Author:

{u'name': u'Esha Maharishi', u'email': u'esha.maharishi@mongodb.com'}

Message: SERVER-26332 remove unnecessary check that RangeDeleter has run after removeShard in remove2.js
Branch: master
https://github.com/mongodb/mongo/commit/f3218cdfc3d9accaa9b793c2b47ee16b19f359f3

Comment by Randolph Tan [ 30/Sep/16 ]

esha.maharishi Before range deleter, the shard used to spawn a separate thread every time it wants to perform chunk cleanup. After RangeDeleter, asynchronous cleanup is instead queued for the RangeDeleter worker to pickup. The RangeDeleter lives for the entire lifetime of the process until it shuts down so it will never go away in currentOp once it is live. However, it is possible to change the currentOp to check that no migrations are running after a shard has been drained as an alternative to creating a make shift concurrency barrier by temporarily stopping the balancer.

Comment by Esha Maharishi (Inactive) [ 30/Sep/16 ]

renctan, it seems like the existing currentOp check is for a thread with name starting with "clean" (regex: /^clean/).

At the time that check was added:

https://github.com/mongodb/mongo/commit/5860ed9463e9275cda4e50008be6d83ec849b560#diff-d8b604217d00e6a6953a0e21228fe14dR35

the range deleter thread was called "cleanupOldData":

https://github.com/mongodb/mongo/blob/5860ed9463e9275cda4e50008be6d83ec849b560/src/mongo/s/d_migrate.cpp#L739

but is now called "RangeDeleter":

https://github.com/mongodb/mongo/blob/r3.3.15/src/mongo/db/range_deleter.cpp#L415

Do you think the test is failing because it's not actually checking if the RangeDeleter is running? If we change the check to be on "RangeDeleter" instead of /^clean/, will we be correctly checking if the RangeDeleter thread is active?

Comment by Randolph Tan [ 26/Sep/16 ]

One potential fix is to use the balancerStop after removeShard in order to wait for the the current balancer round to finish and call balancerStart again. This will make remove2.js not compatible with the last-stable test suite though.

P.S. We should probably also remove the sleep here.

Generated at Thu Feb 08 04:11:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.