[SERVER-37390] RollbackTestFixture doesn't need to wait for a new primary if it didn't shut down the current primary Created: 28/Sep/18  Updated: 29/Oct/23  Resolved: 03/Jan/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.2.3, 4.3.3, 4.0.20

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: Suganthi Mani
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-44679 Rollback fuzzer must account for tran... Closed
Problem/Incident
causes SERVER-45400 Reduce OplogFetcher getMore awaitData... Closed
Related
is related to SERVER-44679 Rollback fuzzer must account for tran... Closed
is related to SERVER-45501 RollbackTest shouldn't run serverStat... Closed
is related to SERVER-42650 Remove stale comments mentioned in th... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2, v4.0
Sprint: Repl 2018-10-22, Repl 2018-11-05, Repl 2019-09-23, Repl 2019-11-18, Execution Team 2019-12-16, Repl 2019-12-02, Repl 2019-12-16, Execution Team 2020-01-13, Execution Team 2019-12-30
Participants:
Linked BF Score: 33

 Description   

The rollback fuzzer suites utilize the RollbackTestFixture to run randomized rollback testing. Some of the suites (e.g. rollback_fuzzer_clean_shutdowns) will shut down nodes at random times. After we shutdown a node, we wait until there is a stable primary so that we can continue doing writes in the test. The way we check this can be racy, though, if we get a current primary that then steps down, and we get that same node as a secondary. Instead of asserting that the primary and secondary are not equal, I think it would be simpler to just check if we shut down the original primary. If so, then we need to wait for a new primary. Otherwise, we shouldn't need to wait, since the original primary should be stable, since it is supported by an arbiter.



 Comments   
Comment by Githook User [ 03/Aug/20 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: SERVER-37390 Run rollback test fixture with high election timeout to avoid any unplanned election.

(cherry picked from commit 02ce213b40c56096c9c57e093778b0889c335bb9)
(cherry picked from commit 5b85b8787d6e8cfd4234b09304f3538506f70bd9)
(cherry picked from commit 9b470eb73873f5db5c9fcee5df5316d477a1fa12)
(cherry picked from commit 04a2c9acc7ca061fb86736b377b897b11f6c7c48)
(cherry picked from commit f2aa1ffe05804aa3cc21ad5f980bca998dde09f3)
(cherry picked from commit 9aa9fa4a2844f0fe7890e01d621960a0c64607f6)
Branch: v4.0
https://github.com/mongodb/mongo/commit/3d9bdde82477bdda180993c435f29d41ca05e52f

Comment by Githook User [ 15/Jan/20 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: SERVER-37390 Run rollback test fixture with high election timeout to
avoid any unplanned election.

(cherry picked from commit 02ce213b40c56096c9c57e093778b0889c335bb9)
(cherry picked from commit 5b85b8787d6e8cfd4234b09304f3538506f70bd9)
(cherry picked from commit 9b470eb73873f5db5c9fcee5df5316d477a1fa12)
Branch: v4.2
https://github.com/mongodb/mongo/commit/9b17593e40df04bc1e001bac2cf9cdda62570011

Comment by William Schultz (Inactive) [ 07/Jan/20 ]

It looks like we already have a failpoint that is used to speed up tests that are slowed down by that awaitData timeout.

Comment by William Schultz (Inactive) [ 06/Jan/20 ]

judah.schvimer A previous investigation turned up this remaining timeout that depends on the election timeout. I did not do a thorough look at it but that was my hypothesis for why rollback tests ran slower when using a 24 hour election timeout. Here are the runtimes of some rollback tests before and after the changes made by suganthi.mani.

Before:

change_stream_speculative_majority_rollback.js ran in 22.75 seconds:
initial_sync_fails_on_rollback.js ran in 22.85 seconds:
prepare_failover_rollback_commit.js ran in 27.97 seconds:
read_committed_after_rollback.js ran in 17.05 seconds:
read_operations_during_rollback.js ran in 27.93 seconds:
rollback_aborted_prepared_transaction.js ran in 22.71 seconds:
rollback_after_disabling_majority_reads.js ran in 42.04 seconds:
rollback_after_enabling_majority_reads.js ran in 73.88 seconds:
rollback_all_op_types.js ran in 38.89 seconds:
rollback_auth.js ran in 47.11 seconds:
rollback_capped_deletions.js ran in 19.17 seconds:
rollback_collmods.js ran in 12.03 seconds:
rollback_creates_rollback_directory.js ran in 77.88 seconds:
rollback_crud_op_sequences.js ran in 37.10 seconds:
rollback_ddl_op_sequences.js ran in 37.37 seconds:
rollback_drop_database.js ran in 19.30 seconds:
rollback_drop_index_after_rename.js ran in 24.96 seconds:
rollback_dup_ids.js ran in 24.09 seconds:
rollback_files_no_prepare_conflict.js ran in 19.10 seconds:
rollback_index_build_and_create.js ran in 18.92 seconds:
rollback_index_build_start.js ran in 19.10 seconds:
rollback_index_build_start_abort.js ran in 61.67 seconds:
rollback_index_build_start_abort_not_create.js ran in 19.06 seconds:
rollback_index_build_start_commit.js ran in 63.19 seconds:
rollback_index_build_start_commit_drop.js ran in 102.15 seconds:
rollback_index_build_start_not_create.js ran in 27.07 seconds:
rollback_prepare_transaction.js ran in 19.01 seconds:
rollback_reconstructs_transactions_prepared_before_stable.js ran in 23.91 seconds:
rollback_recovery_commit_transaction_before_stable_timestamp.js ran in 21.72 seconds:
rollback_remote_cursor_retry.js ran in 20.26 seconds:
rollback_rename_collection_on_sync_source.js ran in 19.91 seconds:
rollback_rename_count.js ran in 19.66 seconds:
rollback_time_limit_param.js ran in 21.48 seconds:
rollback_transaction_table.js ran in 13.96 seconds:
rollback_transactions_count.js ran in 19.18 seconds:
rollback_unprepared_transactions.js ran in 22.02 seconds:
rollback_via_refetch_commit_transaction.js ran in 19.94 seconds:
rollback_via_refetch_survives_nonexistent_collection_drop.js ran in 23.90 seconds:
rollback_views.js ran in 38.86 seconds:
rollback_waits_for_bgindex_completion.js ran in 20.24 seconds:
rollback_with_socket_error_then_steady_state.js ran in 37.12 seconds:
transactions_after_rollback_via_refetch.js ran in 20.55 seconds:
unrecoverable_rollback_early_exit.js ran in 453.88 seconds:
validate_fails_during_rollback.js ran in 21.65 seconds:

After:

change_stream_speculative_majority_rollback.js ran in 39.94 seconds:
initial_sync_fails_on_rollback.js ran in 72.80 seconds:
prepare_failover_rollback_commit.js ran in 107.80 seconds:
read_committed_after_rollback.js ran in 18.80 seconds:
read_operations_during_rollback.js ran in 70.29 seconds:
rollback_aborted_prepared_transaction.js ran in 70.68 seconds:
rollback_after_disabling_majority_reads.js ran in 82.30 seconds:
rollback_after_enabling_majority_reads.js ran in 164.67 seconds:
rollback_all_op_types.js ran in 23.36 seconds:
rollback_auth.js ran in 42.78 seconds:
rollback_capped_deletions.js ran in 69.18 seconds:
rollback_collmods.js ran in 11.89 seconds:
rollback_creates_rollback_directory.js ran in 65.06 seconds:
rollback_crud_op_sequences.js ran in 37.12 seconds:
rollback_ddl_op_sequences.js ran in 52.52 seconds:
rollback_drop_database.js ran in 69.33 seconds:
rollback_drop_index_after_rename.js ran in 69.33 seconds:
rollback_dup_ids.js ran in 77.32 seconds:
rollback_files_no_prepare_conflict.js ran in 69.06 seconds:
rollback_index_build_and_create.js ran in 68.95 seconds:
rollback_index_build_start.js ran in 72.97 seconds:
rollback_index_build_start_abort.js ran in 234.38 seconds:
rollback_index_build_start_abort_not_create.js ran in 70.32 seconds:
rollback_index_build_start_commit.js ran in 242.54 seconds:
rollback_index_build_start_commit_drop.js ran in 369.61 seconds:
rollback_index_build_start_not_create.js ran in 69.11 seconds:
rollback_prepare_transaction.js ran in 68.96 seconds:
rollback_reconstructs_transactions_prepared_before_stable.js ran in 44.10 seconds:
rollback_recovery_commit_transaction_before_stable_timestamp.js ran in 70.92 seconds:
rollback_remote_cursor_retry.js ran in 71.24 seconds:
rollback_rename_collection_on_sync_source.js ran in 39.30 seconds:
rollback_rename_count.js ran in 69.41 seconds:
rollback_time_limit_param.js ran in 2.44 seconds:
rollback_transaction_table.js ran in 17.00 seconds:
rollback_transactions_count.js ran in 69.17 seconds:
rollback_unprepared_transactions.js ran in 70.03 seconds:
rollback_via_refetch_commit_transaction.js ran in 45.02 seconds:
rollback_via_refetch_survives_nonexistent_collection_drop.js ran in 74.98 seconds:
rollback_views.js ran in 38.13 seconds:
rollback_waits_for_bgindex_completion.js ran in 69.40 seconds:
rollback_with_socket_error_then_steady_state.js ran in 37.25 seconds:
transactions_after_rollback_via_refetch.js ran in 70.54 seconds:
unrecoverable_rollback_early_exit.js ran in 449.89 seconds:
validate_fails_during_rollback.js ran in 69.07 seconds:

Those are all tests with "rollback" in their name, so it may include some that don't use RollbackTest fixture. Here is a general sense of the timing distribution before and after the changes, limited to tests between 0 and 100 seconds in duration.

Before (mean=40.05 seconds)

    0.00 |0     |
   10.00 |14    | ************************
   20.00 |17    | ******************************
   30.00 |5     | ********
   40.00 |2     | ***
   50.00 |0     |
   60.00 |2     | ***
   70.00 |2     | ***
   80.00 |0     |
   90.00 |0     |

After (mean=84.52 seconds)

    0.00 |1     | **
   10.00 |3     | *******
   20.00 |1     | **
   30.00 |5     | ************
   40.00 |3     | *******
   50.00 |1     | **
   60.00 |12    | ******************************
   70.00 |11    | ***************************
   80.00 |1     | **
   90.00 |0     |

Our thought was to introduce a failpoint to reduce that timeout for the RollbackTest fixture.

Comment by Judah Schvimer [ 06/Jan/20 ]

william.schultz, how does the high election timeout slow down the test fixture? The timeout is 24 hours, and the test doesn't take 24 hours, so I would expect us to not rely on that timeout ever.

Comment by William Schultz (Inactive) [ 06/Jan/20 ]

suganthi.mani Can we file a ticket for improving the speed of RollbackTest fixture (see comment above) now that these changes are pushed? Per our latest in person discussion, my understanding was that the high election timeout slows down the test fixture somewhat significantly.

Comment by Githook User [ 03/Jan/20 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: SERVER-37390 Run rollback test fixture with high election timeout to
avoid any unplanned election.
Branch: master
https://github.com/mongodb/mongo/commit/02ce213b40c56096c9c57e093778b0889c335bb9

Comment by William Schultz (Inactive) [ 15/Nov/19 ]

suganthi.mani In relation to our discussion about using a high election timeout in RollbackTest to prevent any unwanted topology changes, I noticed when running a sample rollback test that this timeout can cause a test to run much slower if we use an extremely high election timeout i.e. 24 hours. It added around 50 seconds of execution time to the "jstests/replsets/rollback_rename_collection_on_sync_source.js" test. That awaitData timeout may be something we could make configurable for tests that use a high election timeout.

Comment by Suganthi Mani [ 14/Nov/19 ]

Note to myself-> This needs to backported to 4.2 + 4.0.

Comment by William Schultz (Inactive) [ 03/Sep/19 ]

As of now, I don't think this will be included in the Faster Local Testing (FLT) project. This ticket was originally filed to address a correctness concern mentioned in the linked BF. It was not about improving performance of the test fixture.

Comment by Ratika Gandhi [ 29/Aug/19 ]

william.schultz, Adding you here to please check if this is in scope for Faster Local Testing. 

Generated at Thu Feb 08 04:45:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.