[SERVER-79191] continuous_initial_sync.py Can Be in Rollback During FSM Teardown Created: 21/Jul/23 Updated: 29/Jan/24 Resolved: 19/Jan/24 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 7.3.0-rc0, 7.0.6 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Brett Nawrocki | Assignee: | Sean Zimmerman |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | repl-shortlist | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Replication
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v7.2, v7.1, v7.0, v6.0
|
||||||||
| Sprint: | Repl 2023-08-07, Repl 2023-08-21, Repl 2023-09-04, Repl 2023-09-18, Repl 2023-10-02, Repl 2023-10-16, Repl 2023-10-30, Repl 2023-11-13, Repl 2023-11-27, Repl 2023-12-11, Repl 2023-12-25, Repl 2024-01-08, Repl 2024-01-22 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 8 | ||||||||
| Description |
|
It's possible for the ContinuousInitialSync hook to leave the cluster in rollback during FSM teardown (see this comment for more details), which can cause the error seen in BF-28126. It should be ensured that the topology is stable (e.g. by performing a write and ensuring it is replicated successfully to all nodes) before entering FSM teardown. |
| Comments |
| Comment by Githook User [ 29/Jan/24 ] |
|
Author: {'name': 'seanzimm', 'email': '102551488+seanzimm@users.noreply.github.com', 'username': 'seanzimm'}Message: GitOrigin-RevId: 290fbfb8c30649de9b12509ac0ca22ade1cf0f15 |
| Comment by Githook User [ 19/Jan/24 ] |
|
Author: {'name': 'seanzimm', 'email': '102551488+seanzimm@users.noreply.github.com', 'username': 'seanzimm'}Message: GitOrigin-RevId: 0b8363faeb014d3dc3dea5089f2db5fcfee9d6ba |
| Comment by Sean Zimmerman [ 28/Jul/23 ] |
|
Initial thoughts from looking at it this morning. The ContinuousInitialSync hook has a very similar structure to [ContinuousStepdown|https://github.com/10gen/mongo/blob/master/buildscripts/resmokelib/testing/hooks/stepdown.py] . I don't notice any additional concurrency mechanisms that stepdown uses that continuous initial sync does not. The thread is paused between tests and killed and restarted between suites.
At a glance the methods for stepping up the new primary seem similar too, ContinuousInitialSync directly calls replSetStepUp while stepdown uses the fixture stepup_node but the code in ContiunousInitialSync seems to be exactly the same as that function. I will continue looking more into it but right now I don't see any specific reason why ContinuousInitialSync would have this problem but ContinuousStepdown does not |
| Comment by Max Hirschhorn [ 21/Jul/23 ] |
|
I wonder if continuous_initial_sync.py is missing some synchronization to quiesce the MongoDB cluster and ensure the topology is stable. The concurrency framework is written expecting the topology is stable when both of the $config.setup() and $config.teardown() functions are being called. The concurrency framework only expects elections, shutdowns, rollbacks, etc. during the execution of the $config.states functions. I don't think the problem we'd have here would be specific to the multi_statement_transaction_simple.js FSM workload. This might be something better looked into by the Replication team. |