[SERVER-44068] Investigate the slowest sections of RollbackTest and remove any wasted downtime Created: 17/Oct/19 Updated: 29/Oct/23 Resolved: 27/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0 |
| Type: | Task | Priority: | Minor - P4 |
| Reporter: | William Schultz (Inactive) | Assignee: | William Schultz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | former-quick-wins | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||||||
| Sprint: | Repl 2020-04-20, Repl 2020-05-04 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
We should investigate the slowest parts of the RollbackTest fixture to identify and remove any areas of wasted downtime. Wasted downtime would be any time spent not doing useful, required work. For example, waiting for data to propagate over the network or waiting for an arbitrary timeout to fire in the server e.g. a heartbeat interval or election timeout. In many of our tests that use the rollback test fixture, the duration of the rollback itself is often only a few hundred milliseconds. In theory, our test fixtures should be able to run a rollback test sequence in a time not much longer than it takes to complete a rollback, setting aside required setup and teardown time. We should investigate to see if this is feasible with our current testing infrastructure. |
| Comments |
| Comment by William Schultz (Inactive) [ 27/Apr/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think we may want to backport the 'restartHeartbeats' command (i.e. this commit) to 4.4 so that we don't run in to multiversion failures when we re-enable multiversion suites, which are currently disabled (see | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 27/Apr/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
After the final changes have been pushed, we can see the overall improvements from this ticket, by measuring the durations of transitions inside the rollback_test_control.js test on my local workstation with --opt=on --dbg=on build flags: Before
After
This represents an approximate reduction of 700ms for the kSyncSourceOpsBeforeRollback transition and 1973ms for the kSteadyStateOps transition. To understand these numbers a bit more accurately, we can also look at the before/after distributions of durations for the transitions we improved. the numbers below are taken from 20 repeated runs of rollback_test_control.js on my local workstation, before and after the changes. kSyncSourceOpsBeforeRollback, Before kSyncSourceOpsBeforeRollback, After kSteadyStateOps, Before kSteadyStateOps, After | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 27/Apr/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 14/Apr/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 14/Apr/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 13/Apr/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 13/Apr/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 10/Apr/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
This patch's goal to is to remove any remaining speed bottlenecks in RollbackTest. I added log messages to measure the durations of each RollbackTest transition method, and added a control test that does a "no-op" rollback sequence so that we can measure the overhead of the fixture accurately. My goal is to measure and optimize the time spent inside the transition methods since that is the only part of a test that the RollbackTest fixture can control. If expensive/slow operations are run in between transitions that will make the test slow but is not the fault of RollbackTest. The speed improvements focus on the slowest sections of RollbackTest. In a baseline run of rollback_test_control.js, before any improvements, the durations of the transition methods are as follows:
The 'kSyncSourceOpsBeforeRollback' transition and the 'kSteadyStateOps' transition are the slowest. We can look at each of them separately to understand why they are slow. kSyncSourceOpsBeforeRollbackThe 'kSyncSourceOpsBeforeRollback' transition is made slow primarily by the election that occurs to switch primary from the rollback node to the sync source node. The sync source node repeatedly fails to run for election even in response to an explicit stepUp request because it cannot see a majority.
This continues until the sync source node (20281) finally receives a heartbeat from the tiebreaker node (20282) which alerts it of a healthy majority. Then the next stepUp attempt succeeds and the node gets elected:
The test fixture is not aware of the newly elected primary, however, until another ~200ms after the transition to primary completes, which appears to be due to the default retry interval of the getPrimary call. To resolve this slowness, we want to immediately trigger a round of heartbeats after reconnecting the sync source node and also ensure that the test fixture promptly elects a primary and learns of it by reducing the retry interval of these replSetStepUp attempts and of this getPrimary call. kSteadyStateOpsThe 'kSteadyStateOps' transition is slow for a few different reasons. First, we focus on the block of code that waits for the rollback to complete. In the baseline run, this section took 1456ms:
We fail many iterations of the loop that waits for rollback to complete, and we can see that the rollback starts towards the end of this section, so it's not the rollback itself that is slow.
It is instead taking a long time for the rollback node to find the sync source that it is going to roll back against after we reconnect it. This is a byproduct of the various 1 second sleeps in bgsync.cpp that limit how fast we can re-select a sync source. This can be resolved by utilizing the 'forceBgSyncSyncSourceRetryWaitMS' failpoint to make these sleeps much shorter. The time it takes for the fixture to detect rollback completion is also limited by the retry interval of the assert.soonNoExcept loop, so these retry intervals can be shortened as well. After rollback has completed, there is another slow section where we reconnect the tiebreaker and wait for it to replicate the latest ops, which takes 627ms from the time we reconnect the node to the time we finish waiting for replication:
Once again, this is slow because of how long it takes for the tiebreaker node to re-establish a sync source after being re-connected. This appears to be caused by the 1 second sleep that is triggered inside the stopReplProducer failpoint. After we disable the failpoint, it may take up to a second for us to move past it. Furthermore, the awaitReplication retry interval limits how fast the fixture can detect newly replicated ops, so its retry interval should be shortened. After we wait for replication on the tiebreaker, we do data consistency checks, which may take a non negligible amount of time i.e. several hundred milliseconds. We are not focused on speeding up consistency checks themselves in this ticket, so we will leave those as is. Finally, we call stopServerReplication on the tiebreaker node as the final step of the transition to kSteadyStateOps. The time taken between actually re-enabling the failpoint and when we complete the transition is around 303ms. This appears to be a consequence of the 300ms retry interval used by checkLog.contains. That can be reduced to detect the new failpoint more quickly.
| ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 17/Oct/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
For some crude metrics on this, here are filtered logs showing all "transition to ROLLBACK from SECONDARY" and "transition to SECONDARY from ROLLBACK" log messages in an Evergreen run of replica_sets on an enterprise-rhel-62-64-bit variant: rollbacks.txt |