[SERVER-26250] Balancer holding distlock briefly on recover fails a subsequent split (or potentially any distlock operation) command Created: 22/Sep/16  Updated: 31/Oct/16  Resolved: 24/Oct/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.4.0-rc2

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Dianna Hohensee (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2016-10-10, Sharding 2016-10-31
Participants:
Linked BF Score: 0

 Description   

The moveChunk is returned to the mongos with response OK, stepdown occurs and the balancer keeps the migration document. Balancer recovers, acquires distlock because of the migration document, reloads the chunk metadata and discovers that the chunk has already moved, and then the balancer releases the distlock. However, the balancer holding the distlock briefly interferes with the JS test's subsequent split command that occurs after that moveChunk command.



 Comments   
Comment by Githook User [ 24/Oct/16 ]

Author:

{u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}

Message: SERVER-26250 extend moveChunk command success to depend on migration document removal success
Branch: master
https://github.com/mongodb/mongo/commit/913a092022f98fc8eb5a035e515cc2cd9908443a

Comment by Dianna Hohensee (Inactive) [ 10/Oct/16 ]

Going with option 1) above. Extending moveChunk command success to depend on whether the deletion of the migration document succeeded.

Comment by Dianna Hohensee (Inactive) [ 22/Sep/16 ]

Options that come to mind right now:
1) don’t return OK for moveChunk if migration document is still there
2) reload chunk metadata in drain mode so distlocks can be released before accepting any new commands (like split)
3) other distlock requiring operations must be more resilient to retrying? Or maybe only if the balancer is recovering…...

None of these is very appealing... 2) is the cleanest, but increases time spent in drain mode, reloading chunk metadata for every collection in which there are active migrations happening.

Generated at Thu Feb 08 04:11:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.