[SERVER-76546] _migrateClone can deadlock with prepared transactions on secondaries Created: 26/Apr/23  Updated: 29/Oct/23  Resolved: 17/May/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 5.0.19, 7.0.0-rc6, 6.0.8

Type: Bug Priority: Major - P3
Reporter: Louis Williams Assignee: Randolph Tan
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-77242 Audit call sites that take the PBWM lock Backlog
is related to SERVER-71028 MigrationChunkClonerSourceLegacy::nex... Backlog
Assigned Teams:
Sharding NYC
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0, v6.0, v5.0
Sprint: Sharding NYC 2023-05-29
Participants:
Linked BF Score: 105
Story Points: 5

 Description   

The _migrateClone command isn't supposed to run on secondaries, but we perform no
checks for whether the node is currently primary after first admitting the command. If this command is still running while in the secondary state, it is possible for this deadlock to occur:

  • The command blocks on a prepared transaction while holding the PBWM lock
  • Oplog batch application blocks because the PBWM lock is held
  • Any attempt to end the prepared transaction would never be processed because oplog application is stuck.

Note that this deadlock requires the _migrateClone to start on a primary and then block on a prepared transaction after the node steps down to secondary.

The issue can most easily be solved in one of two ways:

  • Fail the _migrateClone command after taking the AutoGetActiveCloner and checking that the node is still primary. This helper takes the AutoGetCollection, which in turn takes the RSTL so that we can guarantee the node stays primary while it is in scope.
  • If for some reason we want this command to proceed while secondary, we could use AutoGetCollectionForRead or AutoGetCollectionForReadLockFree, which both skip taking the PBWM lock.


 Comments   
Comment by Githook User [ 23/Jun/23 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-76546 _migrateClone can deadlock with prepared transactions on secondaries

(cherry picked from commit e90dcb18de438b6b6ab02b2c921463fd35b866cb)
(cherry picked from commit 623c3ea57d5222d79c2e3d68ec94485f183fe35d)
Branch: v5.0
https://github.com/mongodb/mongo/commit/e1297e894f5323d28e6236f67a7d9a9e40bdc8c6

Comment by Githook User [ 21/Jun/23 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-76546 _migrateClone can deadlock with prepared transactions on secondaries

(cherry picked from commit e90dcb18de438b6b6ab02b2c921463fd35b866cb)
Branch: v6.0
https://github.com/mongodb/mongo/commit/623c3ea57d5222d79c2e3d68ec94485f183fe35d

Comment by Githook User [ 21/Jun/23 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-76546 _migrateClone can deadlock with prepared transactions on secondaries

(cherry picked from commit e90dcb18de438b6b6ab02b2c921463fd35b866cb)
Branch: v7.0
https://github.com/mongodb/mongo/commit/f568805ae0329f0d4be6a320cde750bc15b9701d

Comment by Githook User [ 17/May/23 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-76546 _migrateClone can deadlock with prepared transactions on secondaries
Branch: master
https://github.com/mongodb/mongo/commit/e90dcb18de438b6b6ab02b2c921463fd35b866cb

Generated at Thu Feb 08 06:32:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.