[SERVER-41037] Stepup should kill all user operations(that encounters prepare conflict) before taking RSTL lock in X. Created: 07/May/19  Updated: 29/Oct/23  Resolved: 23/May/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.12

Type: Task Priority: Major - P3
Reporter: Suganthi Mani Assignee: Pavithra Vetriselvan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-37574 Force reconfig should kill user opera... Closed
depends on SERVER-40700 Deadlock between read prepare conflic... Closed
Related
related to SERVER-41035 Rollback should kill all user operati... Closed
related to SERVER-41216 Rename InterruptedDueToStepDown error... Closed
related to SERVER-37988 recover locks on step up at the begin... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-05-20, Repl 2019-06-03
Participants:

 Description   

Currently, there is a  3 way deadlock. Assume that we are transitioning from secondary to primary(signalDrainComplete).

  • Thread A (Read thread)  acquires RSTL lock in IX and  blocked by a prepared txn due to prepare conflict.
  • Stepup enqueues RSTL lock in X mode. And blocked behind A.
  • Prepared txn won’t be able commit until we are able to change our state to primary.

 

Alternatively, thread A might have got blocked on prepare conflict due to conflicting DB/ collection lock. For eg. dbhash cmd. 

EDIT: Blocked due to conflicting mongoDB locks  is not possible as secondary prepared txn yield mongoDB locks.



 Comments   
Comment by Githook User [ 23/May/19 ]

Author:

{'email': 'pvselvan@umich.edu', 'name': 'Pavi Vetriselvan', 'username': 'pvselvan'}

Message: SERVER-41037 kill all user operations that hit prepare conflicts on step up
Branch: master
https://github.com/mongodb/mongo/commit/9b8814b6e7130652eca6f8fdaa9ff215b07d2ebb

Comment by Judah Schvimer [ 17/May/19 ]

This should use the error code InterruptedDueToStepDown which SERVER-41216 will change to InterruptedDueToReplStateChange.

Comment by Suganthi Mani [ 12/May/19 ]

Quick note on the implementation, _killUserOperationsOnStepDown  should be called in a loop until RSTL lock is acquired in X mode by step up (like step down). Else, we might miss marking the operation killed that might end up in prepare conflict.

Comment by Judah Schvimer [ 08/May/19 ]

Making it depends on makes sense!

Comment by Suganthi Mani [ 07/May/19 ]

judah.schvimer, yes,  similar to SERVER-40700((i.e) we should call the function ReplicationCoordinatorImpl::_killUserOperationsOnStepDown). Since, on secondaries, there cannot be any write, _killUserOperationsOnStepDown will effectively kill operations that blocked by prepare conflict. 

This means we can mark this ticket as depended on SERVER-40700. What do you say?

P.S: Rollback(SERVER-41035 ) can kill all user operations as we close the connection during rollback.

Comment by Judah Schvimer [ 07/May/19 ]

suganthi.mani, can you please clarify, is this only killing user operations that actually get a prepare conflict? Similar to SERVER-40700 but on stepup, and there cannot be any writes because the node was a secondary?

Generated at Thu Feb 08 04:56:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.