[SERVER-22658] Implement support for legacy and new migration protocol -- shard versus balancer holding the distributed lock Created: 16/Feb/16  Updated: 14/Jul/16  Resolved: 13/Jun/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.3.9

Type: Task Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Dianna Hohensee (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-22656 Componentize Migration{Source,Dest}Ma... Closed
Gantt Dependency
has to be done after SERVER-24351 Change DistLockCatalogImpl::unlock to... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 15 (06/03/16), Sharding 16 (06/24/16)
Participants:

 Description   

This task is for the changes which need to be made to the donor shard's migration manager so that it supports both the legacy migration protocol (shard acquires the distlock) and the new protocol where the CSRS primary holds the distributed lock.

Implementation (high level):

Config balancer:
The balancer will always first take the distributed lock for the collection in which it's about to move a chunk. A protocol flag will be added to the moveChunk command. This flag will be ignored if the shard receiving it is old – it won't know to parse for it. In the case of an old shard, the shard will attempt to take the distributed lock as usual and get a LockBusy – or something – error, which will be returned to the balancer. The balancer will then interpret this response as the shard using the old protocol and release the distributed lock and call moveChunk again, and this time the shard will successfully take the distributed lock as usual.

Shard:
The shard will parse the moveChunk command for the new protocol flag. If it finds it, then it will not attempt to take the distributed lock for the collection. Instead the shard should check that the balancer possesses the distributed lock for the collection.

Add testing if any new code does not get checked by existing testing. Probably need to make a multi-version test.



 Comments   
Comment by Githook User [ 13/Jun/16 ]

Author:

{u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}

Message: SERVER-22658 make moveChunk accept a flag to determine whether it should take the distlock and have the balancer take the distlock.
Branch: master
https://github.com/mongodb/mongo/commit/4ea5f4f21610e8251ad5c08dd8c1010862bbdb22

Comment by Dianna Hohensee (Inactive) [ 08/Jun/16 ]

Sounds good. I made SERVER-24470 to deal with it.

Comment by Kaloian Manassiev [ 08/Jun/16 ]

Regarding the first question. I think that it would be sufficient for now to only re-check that the distributed lock is still held if the legacy moveChunk protocol is in use, but not for the new variant where the balancer holds the collection lock. In order to complete the checking, we could later on add code in commitChunkMigration to ensure that there is a dist lock held on the collection for which it is being called. Let's do it as a separate ticket though.

Comment by Kaloian Manassiev [ 08/Jun/16 ]

There is no need for now to keep a cache of the migration protocols. The solution you have described, which uses the LockBusy status as a way to guide the call would be sufficient for now.

Comment by Dianna Hohensee (Inactive) [ 07/Jun/16 ]

kaloian.manassiev

Also, over here https://mongodbcr.appspot.com/74020002 you wrote this as a rough outline for how the balancer will handle the ScopedDistLock

boost::optional<ScopedDistLock> collDistLock;
Status status;
if (shards[from].useProtocolV1) {
    status = scheduleMoveChunkWithProtocolV1(collName, from, to);
}
else {
    collDistLock = distLock(collName);
    status = scheduleMoveChunkWithProtocolV2(collName, from, to);
 
    if (status == ErrorCodes::LockBusy) {
        // The shard must be running older version
        shards[from].useProtocolV1 = true;
        return; // This will free the coll dist lock for the next iteration
    }
}

Do we want to maintain an array of which migration protocol each shard uses? I'm not sure how we would otherwise go about discovering that a V1 protocol shard becomes a V2 protocol shard.

I was originally intending to just call moveChunk with the lock, and then recall moveChunk without it if it returns a LockBusy error. It would then be something like

    Status status;
    {
        ScopedDistLock collDistLock = distLock(collName);
        status = scheduleMoveChunkWithProtocolV2(collName, from, to);
    }
 
    if (status == ErrorCodes::LockBusy) {
        status = scheduleMoveChunkWithProtocolV1(collName, from, to);
    }
 
    // error checks, etc.

Comment by Dianna Hohensee (Inactive) [ 07/Jun/16 ]

kaloian.manassiev

Currently moveChunk checks that the distlock is still held by the shard https://github.com/mongodb/mongo/blob/master/src/mongo/db/s/move_chunk_command.cpp#L245-L253 right before going into the critical section. It won't be able to check the distlock in the same way if it doesn't hold the distlock itself.

We talked about making the balancer use a specific lockSessionID so that if the config server primary goes down the balancer will restart and know what locks it's holding. the lockSessionID is a mongo::OID object that is normally randomly generated. We need to iron out what to do here.

Generated at Thu Feb 08 04:01:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.