[SERVER-38511] Implement new stepdown sequence, gated by “closeConnectionsOnStepdown”. Created: 10/Dec/18  Updated: 29/Oct/23  Resolved: 23/Jan/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.8

Type: New Feature Priority: Major - P3
Reporter: Gregory McKeon (Inactive) Assignee: Suganthi Mani
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-38515 Test that initial sync continues thro... Closed
is depended on by SERVER-38696 Add additional metrics and logging on... Closed
is depended on by SERVER-38756 Reacquiring RSTL lock during stepdown... Closed
is depended on by SERVER-38518 Add change streams testing to new ste... Closed
Problem/Incident
Backwards Compatibility: Fully Compatible
Sprint: Repl 2018-12-17, Repl 2019-01-14, Repl 2019-01-28
Participants:
Linked BF Score: 15

 Description   

Design:

1)  Start a new kill thread which kills all user operations which have taken global IX, X and S lock in a repeated loop.

2) Wait until RSTL is acquired.

3) Stop the kill thread.

 

Tests should include:

  • Test that reads in progress in find or getMore continue after the stepdown.
  • Test that reads not in progress, but with an open cursor, are be able to be continued with getMore following the stepdown.


 Comments   
Comment by Githook User [ 23/Jan/19 ]

Author:

{'email': 'suganthi.mani@mongodb.com', 'name': 'Suganthi Mani', 'username': 'smani87'}

Message: SERVER-38511 Avoid killing read operations on stepdown, gated by new server parameter “closeConnectionsOnStepdown”.
Branch: master
https://github.com/mongodb/mongo/commit/55d6072f0be597e950809d9ebcf9ba16cc96942d

Comment by Tess Avitabile (Inactive) [ 21/Dec/18 ]

schwerin, geert.bosch, matthew.russotto, we decided to proceed with the solution where we repeatedly kill operations taking IX, X, and S locks in a loop while waiting to acquire the RSTL. geert.bosch suggested that we can kill operations in a separate thread, so that we do not need the ability to enqueue locks or check whether we have acquired the lock, and we decided to go with this suggestion.

Other solutions we considered:

  • A variant on the above solution where we kill operations in the same thread. This requires the ability to check whether we have acquired a lock without waiting to acquire the lock or.
  • Marking all operations that they should terminate if they try to acquire an IX, X, or S lock. We rejected this solution, since it adds complexity to the lock state, whereas the above solution only adds complexity to the stepdown path.
  • Requiring operations to decide at the beginning if they will ever take an IX, X, or S lock. We rejected this solution, since it would restrict future operations we could build in our system.
Comment by Tess Avitabile (Inactive) [ 20/Dec/18 ]

I think it is fine to continue implementation on this ticket according to the design for now. That is, we will have a single call to killAllWriteOperations() that kills operations holding IX and X locks, and we will stop closing connections on stepdown. This will allow us to start writing tests that we don't kill readers. Since this behavior will be guarded by the closeConnectionsOnStepdown flag, it is fine that it would cause deadlock with prepared transactions for now. Once we decide on the design, we can file another ticket to handle S locks and handle operations that attempt to acquire IX, X, and S locks after we have already killed operations.

Generated at Thu Feb 08 04:49:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.