[SERVER-79191] continuous_initial_sync.py Can Be in Rollback During FSM Teardown Created: 21/Jul/23  Updated: 29/Jan/24  Resolved: 19/Jan/24

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 7.3.0-rc0, 7.0.6

Type: Bug Priority: Major - P3
Reporter: Brett Nawrocki Assignee: Sean Zimmerman
Resolution: Fixed Votes: 0
Labels: repl-shortlist
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Assigned Teams:
Replication
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.2, v7.1, v7.0, v6.0
Sprint: Repl 2023-08-07, Repl 2023-08-21, Repl 2023-09-04, Repl 2023-09-18, Repl 2023-10-02, Repl 2023-10-16, Repl 2023-10-30, Repl 2023-11-13, Repl 2023-11-27, Repl 2023-12-11, Repl 2023-12-25, Repl 2024-01-08, Repl 2024-01-22
Participants:
Linked BF Score: 8

 Description   

It's possible for the ContinuousInitialSync hook to leave the cluster in rollback during FSM teardown (see this comment for more details), which can cause the error seen in BF-28126.

It should be ensured that the topology is stable (e.g. by performing a write and ensuring it is replicated successfully to all nodes) before entering FSM teardown.



 Comments   
Comment by Githook User [ 29/Jan/24 ]

Author:

{'name': 'seanzimm', 'email': '102551488+seanzimm@users.noreply.github.com', 'username': 'seanzimm'}

Message: SERVER-79191: Make sure topology is stable during ContinuousInitialSyncHook

GitOrigin-RevId: 290fbfb8c30649de9b12509ac0ca22ade1cf0f15
Branch: v7.0
https://github.com/mongodb/mongo/commit/027db00f7684051d08c5b5d23d5d1d632e309871

Comment by Githook User [ 19/Jan/24 ]

Author:

{'name': 'seanzimm', 'email': '102551488+seanzimm@users.noreply.github.com', 'username': 'seanzimm'}

Message: SERVER-79191: Make sure topology is stable during ContinuousInitialSyncHook (#18033)

GitOrigin-RevId: 0b8363faeb014d3dc3dea5089f2db5fcfee9d6ba
Branch: master
https://github.com/mongodb/mongo/commit/d08038252cf0d1346a78399bb82d4f33b78345f2

Comment by Sean Zimmerman [ 28/Jul/23 ]

Initial thoughts from looking at it this morning. The ContinuousInitialSync hook has a very similar structure to [ContinuousStepdown|https://github.com/10gen/mongo/blob/master/buildscripts/resmokelib/testing/hooks/stepdown.py] . I don't notice any additional concurrency mechanisms that stepdown uses that continuous initial sync does not. The thread is paused between tests and killed and restarted between suites.

 

At a glance the methods for stepping up the new primary seem similar too, ContinuousInitialSync directly calls replSetStepUp while stepdown uses the fixture stepup_node but the code in ContiunousInitialSync seems to be exactly the same as that function. I will continue looking more into it but right now I don't see any specific reason why ContinuousInitialSync would have this problem but ContinuousStepdown does not

Comment by Max Hirschhorn [ 21/Jul/23 ]

I wonder if continuous_initial_sync.py is missing some synchronization to quiesce the MongoDB cluster and ensure the topology is stable. The concurrency framework is written expecting the topology is stable when both of the $config.setup() and $config.teardown() functions are being called. The concurrency framework only expects elections, shutdowns, rollbacks, etc. during the execution of the $config.states functions. I don't think the problem we'd have here would be specific to the multi_statement_transaction_simple.js FSM workload.

This might be something better looked into by the Replication team.

Generated at Thu Feb 08 06:40:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.