[SERVER-53118] Make DistLock resilient to step downs on shards Created: 30/Nov/20  Updated: 29/Oct/23  Resolved: 22/Jan/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Task Priority: Major - P3
Reporter: Marcos José Grillo Ramirez Assignee: Kaloian Manassiev
Resolution: Fixed Votes: 0
Labels: PM-1965-Milestone-0
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-53227 Move the DistLock to only be availabl... Closed
Problem/Incident
causes SERVER-54818 Mitigate distributed lock acquisition... Closed
causes SERVER-55574 Migration distlock acquisition fails ... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2020-12-14, Sharding 2020-12-28, Sharding 2021-01-11, Sharding 2021-01-25
Participants:

 Description   

Distributed Locks shouldn't be used on shards because any step down will cause any lock to be held until the timeout is reached even though the operation didn't finish or was interrupted. The config server currently removes all locks on step up, but shards does not have any mechanisms to re-obtain the lock and finish the operation.

As part of the PM-1965 project, we're defining two behaviors for the DDL operations:

  • Under FCV we'll follow most of the previous implemented DDL operation code.
  • On newer versions we'll ensure the guarantees described on the scope document of the project.

With some commands we have to change the communication order, which means that instead of going through the config server, the command will go to the primary shard of the database directly. On some cases, we'll execute code that was previously implemented on the config server, like for example, holding a distributed lock for a resource. This task consist on providing some mechanism to re-obtain the lock after a step down occurs, or, cleaning up the lock after a step up, providing that a split brain scenario is considered and will not leave the system on an inconsistent state.



 Comments   
Comment by Githook User [ 21/Jan/21 ]

Author:

{'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}

Message: SERVER-53118 Make the DistLockManager use the same processId/lockSessionId per shard
Branch: master
https://github.com/mongodb/mongo/commit/3d3df4681368ceccf891219ce848fe41d2903832

Comment by Githook User [ 13/Jan/21 ]

Author:

{'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}

Message: SERVER-53118 Make the ScopedDistLock movable between threads
Branch: master
https://github.com/mongodb/mongo/commit/e0c8a0bfdfc155443dcdae5bdb82575f4bfdb669

Comment by Githook User [ 09/Jan/21 ]

Author:

{'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}

Message: SERVER-53118 Make the DistLockManager ProcessId for each node to be the ShardId

Following the model of the config server, for stepdown purposes, this
change makes the ProcessId for the dist lock of each node even on a
shard to be the ShardId.
Branch: master
https://github.com/mongodb/mongo/commit/8b6ac29a0a5133fde5dbff8d39347ca35d187eae

Generated at Thu Feb 08 05:29:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.