[SERVER-44068] Investigate the slowest sections of RollbackTest and remove any wasted downtime Created: 17/Oct/19  Updated: 29/Oct/23  Resolved: 27/Apr/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Task Priority: Minor - P4
Reporter: William Schultz (Inactive) Assignee: William Schultz (Inactive)
Resolution: Fixed Votes: 0
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File after_stats_kSteadyStateOps.png     PNG File after_stats_kSyncSourceOpsBeforeRollback.png     PNG File before_stats_kSteadyStateOps.png     PNG File before_stats_kSyncSourceOpsBeforeRollback.png     Text File rollbacks.txt     Text File rollbacks_with_rollback_test.txt    
Issue Links:
Backports
Duplicate
is duplicated by SERVER-45627 Investigate slowest sections of Rollb... Closed
Related
is related to SERVER-47702 Tests should not try to resume index ... Closed
is related to SERVER-47767 read_operations_during_rollback.js sh... Closed
is related to SERVER-47814 Tests that pass a custom ReplSetTest ... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: Repl 2020-04-20, Repl 2020-05-04
Participants:

 Description   

We should investigate the slowest parts of the RollbackTest fixture to identify and remove any areas of wasted downtime. Wasted downtime would be any time spent not doing useful, required work. For example, waiting for data to propagate over the network or waiting for an arbitrary timeout to fire in the server e.g. a heartbeat interval or election timeout.

In many of our tests that use the rollback test fixture, the duration of the rollback itself is often only a few hundred milliseconds. In theory, our test fixtures should be able to run a rollback test sequence in a time not much longer than it takes to complete a rollback, setting aside required setup and teardown time. We should investigate to see if this is feasible with our current testing infrastructure.



 Comments   
Comment by William Schultz (Inactive) [ 27/Apr/20 ]

I think we may want to backport the 'restartHeartbeats' command (i.e. this commit) to 4.4 so that we don't run in to multiversion failures when we re-enable multiversion suites, which are currently disabled (see SERVER-47083).

Comment by William Schultz (Inactive) [ 27/Apr/20 ]

After the final changes have been pushed, we can see the overall improvements from this ticket, by measuring the durations of transitions inside the rollback_test_control.js test on my local workstation with --opt=on --dbg=on build flags:

Before

[js_test:rollback_test_control] 2020-04-09T20:41:00.516+0000 [RollbackTest] RollbackTest transition to kRollbackOps took 14 ms
[js_test:rollback_test_control] 2020-04-09T20:41:01.349+0000 [RollbackTest] RollbackTest transition to kSyncSourceOpsBeforeRollback took 835 ms
[js_test:rollback_test_control] 2020-04-09T20:41:01.350+0000 [RollbackTest] RollbackTest transition to kSyncSourceOpsDuringRollback took 1 ms
[js_test:rollback_test_control] 2020-04-09T20:41:04.197+0000 [RollbackTest] RollbackTest transition to kSteadyStateOps took 2847 ms
[js_test:rollback_test_control] 2020-04-09T20:41:04.214+0000 [RollbackTest] RollbackTest transition to kStopped took 16 ms

After

[js_test:rollback_test_control] 2020-04-27T17:06:32.471-0400 [RollbackTest] RollbackTest transition to kRollbackOps took 19 ms
[js_test:rollback_test_control] 2020-04-27T17:06:32.572-0400 [RollbackTest] RollbackTest transition to kSyncSourceOpsBeforeRollback took 101 ms
[js_test:rollback_test_control] 2020-04-27T17:06:32.574-0400 [RollbackTest] RollbackTest transition to kSyncSourceOpsDuringRollback took 2 ms
[js_test:rollback_test_control] 2020-04-27T17:06:33.448-0400 [RollbackTest] RollbackTest transition to kSteadyStateOps took 874 ms
[js_test:rollback_test_control] 2020-04-27T17:06:33.465-0400 [RollbackTest] RollbackTest transition to kStopped took 17 ms

This represents an approximate reduction of 700ms for the kSyncSourceOpsBeforeRollback transition and 1973ms for the kSteadyStateOps transition.

To understand these numbers a bit more accurately, we can also look at the before/after distributions of durations for the transitions we improved. the numbers below are taken from 20 repeated runs of rollback_test_control.js on my local workstation, before and after the changes.

kSyncSourceOpsBeforeRollback, Before
Median: 842.5ms
SD: 305.3ms
95 pctile: 866.15ms

kSyncSourceOpsBeforeRollback, After
Median: 65.5ms
SD: 25.1ms
95 pctile: 108.8ms

kSteadyStateOps, Before
Median: 2975.5ms
SD: 369.5ms
95 pctile: 3590.9ms

kSteadyStateOps, After
Median: 877.5ms
SD: 103.4ms
95 pctile: 1143.45ms

Comment by Githook User [ 27/Apr/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-44068 Speed up transition to 'kSteadyStateOps' in RollbackTest
Branch: master
https://github.com/mongodb/mongo/commit/a30cef00b9e3ccfe60a4dce7f9f2c7cbaf1e68b7

Comment by Githook User [ 14/Apr/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-44068 Speed up transition to 'kSyncSourceOpsBeforeRollback' in RollbackTest
Branch: master
https://github.com/mongodb/mongo/commit/d0b9ca3aa8015954b9d0fa4f2f14c142c32625f5

Comment by Githook User [ 14/Apr/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-44068 Add a 'restartHeartbeats' replication test command
Branch: master
https://github.com/mongodb/mongo/commit/a697d23b924a47084501caee38365970f89852d3

Comment by Githook User [ 13/Apr/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-44068 Add RollbackTest control test
Branch: master
https://github.com/mongodb/mongo/commit/60102e6a083fe3290e30bbad0c83c6293ba8b301

Comment by Githook User [ 13/Apr/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-44068 Add log messages to measure transition durations in RollbackTest
Branch: master
https://github.com/mongodb/mongo/commit/f72ac59661db885c5715a33f703076ae1a23f436

Comment by William Schultz (Inactive) [ 10/Apr/20 ]

This patch's goal to is to remove any remaining speed bottlenecks in RollbackTest. I added log messages to measure the durations of each RollbackTest transition method, and added a control test that does a "no-op" rollback sequence so that we can measure the overhead of the fixture accurately. My goal is to measure and optimize the time spent inside the transition methods since that is the only part of a test that the RollbackTest fixture can control. If expensive/slow operations are run in between transitions that will make the test slow but is not the fault of RollbackTest.

The speed improvements focus on the slowest sections of RollbackTest. In a baseline run of rollback_test_control.js, before any improvements, the durations of the transition methods are as follows:

[js_test:rollback_test_control] 2020-04-09T20:41:00.516+0000 [RollbackTest] RollbackTest transition to kRollbackOps took 14 ms
[js_test:rollback_test_control] 2020-04-09T20:41:01.349+0000 [RollbackTest] RollbackTest transition to kSyncSourceOpsBeforeRollback took 835 ms
[js_test:rollback_test_control] 2020-04-09T20:41:01.350+0000 [RollbackTest] RollbackTest transition to kSyncSourceOpsDuringRollback took 1 ms
[js_test:rollback_test_control] 2020-04-09T20:41:04.197+0000 [RollbackTest] RollbackTest transition to kSteadyStateOps took 2847 ms
[js_test:rollback_test_control] 2020-04-09T20:41:04.214+0000 [RollbackTest] RollbackTest transition to kStopped took 16 ms

The 'kSyncSourceOpsBeforeRollback' transition and the 'kSteadyStateOps' transition are the slowest. We can look at each of them separately to understand why they are slow.

kSyncSourceOpsBeforeRollback

The 'kSyncSourceOpsBeforeRollback' transition is made slow primarily by the election that occurs to switch primary from the rollback node to the sync source node. The sync source node repeatedly fails to run for election even in response to an explicit stepUp request because it cannot see a majority.

[js_test:rollback_test_control] 2020-04-09T20:41:00.534+0000 [RollbackTest] Waiting for the new primary ip-10-122-24-2:20271 to be elected
[js_test:rollback_test_control] 2020-04-09T20:41:00.534+0000 d20281| 2020-04-09T20:41:00.534+00:00 I  COMMAND  21581   [conn2] "Received replSetStepUp request"
[js_test:rollback_test_control] 2020-04-09T20:41:00.534+0000 d20281| 2020-04-09T20:41:00.534+00:00 I  REPL     21485   [conn2] "Canceling catchup takeover callback"
[js_test:rollback_test_control] 2020-04-09T20:41:00.534+0000 d20281| 2020-04-09T20:41:00.534+00:00 I  ELECTION 4615657 [conn2] "Not starting an election for a replSetStepUp request, since we are not electable","attr":{"reason":"Not standing for election because I cannot see a majority (mask 0x1)"}
...
[js_test:rollback_test_control] 2020-04-09T20:41:00.746+0000 d20281| 2020-04-09T20:41:00.735+00:00 I  COMMAND  21581   [conn2] "Received replSetStepUp request"
[js_test:rollback_test_control] 2020-04-09T20:41:00.746+0000 d20281| 2020-04-09T20:41:00.735+00:00 I  ELECTION 4615657 [conn2] "Not starting an election for a replSetStepUp request, since we are not electable","attr":{"reason":"Not standing for election because I cannot see a majority (mask 0x1)"}
...
[js_test:rollback_test_control] 2020-04-09T20:41:00.937+0000 d20281| 2020-04-09T20:41:00.936+00:00 I  COMMAND  21581   [conn2] "Received replSetStepUp request"
[js_test:rollback_test_control] 2020-04-09T20:41:00.937+0000 d20281| 2020-04-09T20:41:00.936+00:00 I  ELECTION 4615657 [conn2] "Not starting an election for a replSetStepUp request, since we are not electable","attr":{"reason":"Not standing for election because I cannot see a majority (mask 0x1)"}

This continues until the sync source node (20281) finally receives a heartbeat from the tiebreaker node (20282) which alerts it of a healthy majority. Then the next stepUp attempt succeeds and the node gets elected:

[js_test:rollback_test_control] 2020-04-09T20:41:01.034+0000 d20281| 2020-04-09T20:41:01.033+00:00 D2 REPL_HB  4615620 [ReplCoord-0] "Received response to heartbeat","attr":{"requestId":46,"target":"ip-10-122-24-2:20272","response":{"ok":1.0,"state":2,"v":3,"configTerm":1,"set":"RollbackTest","syncingTo":"ip-10-122-24-2:20270","term":1,"primaryId":0,"durableOpTime":{"ts":{"$timestamp":{"t":1586464860,"i":2}},"t":1},"durableWallTime":{"$date":"2020-04-09T20:41:00.178Z"},"opTime":{"ts":{"$timestamp":{"t":1586464860,"i":2}},"t":1},"wallTime":{"$date":"2020-04-09T20:41:00.178Z"},"$replData":{"term":1,"lastOpCommitted":{"ts":{"$timestamp":{"t":1586464860,"i":2}},"t":1},"lastCommittedWall":{"$date":"2020-04-09T20:41:00.178Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1586464860,"i":2}},"t":1},"configVersion":3,"configTerm":1,"replicaSetId":{"$oid":"5e8f885b087643361a3fe7fb"},"primaryIndex":0,"syncSourceIndex":0,"isPrimary":false},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1586464860,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1586464860,"i":2}}}}
...
[js_test:rollback_test_control] 2020-04-09T20:41:01.138+0000 d20281| 2020-04-09T20:41:01.138+00:00 I  COMMAND  21581   [conn2] "Received replSetStepUp request"
[js_test:rollback_test_control] 2020-04-09T20:41:01.138+0000 d20281| 2020-04-09T20:41:01.138+00:00 I  ELECTION 4615661 [conn2] "Starting an election due to step up request"
...
[js_test:rollback_test_control] 2020-04-09T20:41:01.139+0000 d20281| 2020-04-09T20:41:01.139+00:00 I  ELECTION 21444   [ReplCoord-5] "Dry election run succeeded, running for election","attr":{"newTerm":2}
...
[js_test:rollback_test_control] 2020-04-09T20:41:01.156+0000 d20281| 2020-04-09T20:41:01.155+00:00 I  REPL     21331   [OplogApplier-0] "Transition to primary complete; database writes are now permitted"
...
[js_test:rollback_test_control] 2020-04-09T20:41:01.349+0000 [RollbackTest] Elected the old secondary ip-10-122-24-2:20271 as the new primary

The test fixture is not aware of the newly elected primary, however, until another ~200ms after the transition to primary completes, which appears to be due to the default retry interval of the getPrimary call. To resolve this slowness, we want to immediately trigger a round of heartbeats after reconnecting the sync source node and also ensure that the test fixture promptly elects a primary and learns of it by reducing the retry interval of these replSetStepUp attempts and of this getPrimary call.

kSteadyStateOps

The 'kSteadyStateOps' transition is slow for a few different reasons. First, we focus on the block of code that waits for the rollback to complete. In the baseline run, this section took 1456ms:

[js_test:rollback_test_control] 2020-04-09T20:41:01.350+0000 [RollbackTest] Wait for ip-10-122-24-2:20270 to finish rollback
[js_test:rollback_test_control] 2020-04-09T20:41:02.806+0000 [jsTest] [RollbackTest] Rollback on ip-10-122-24-2:20270 completed, reconnecting tiebreaker

We fail many iterations of the loop that waits for rollback to complete, and we can see that the rollback starts towards the end of this section, so it's not the rollback itself that is slow.

[js_test:rollback_test_control] 2020-04-09T20:41:01.350+0000 [RollbackTest] Reconnecting the secondary ip-10-122-24-2:20270 so it'll go into rollback
[js_test:rollback_test_control] 2020-04-09T20:41:01.350+0000 [RollbackTest] Wait for ip-10-122-24-2:20270 to finish rollback
[js_test:rollback_test_control] 2020-04-09T20:41:01.354+0000 assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:01.354+0000 assert.soonNoExcept caught exception, exception: Error: assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:01.566+0000 assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:01.566+0000 assert.soonNoExcept caught exception, exception: Error: assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:01.766+0000 assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:01.766+0000 assert.soonNoExcept caught exception, exception: Error: assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:01.971+0000 assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:01.972+0000 assert.soonNoExcept caught exception, exception: Error: assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:02.177+0000 assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:02.177+0000 assert.soonNoExcept caught exception, exception: Error: assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:02.382+0000 assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:02.383+0000 assert.soonNoExcept caught exception, exception: Error: assert failed : Expected RBID to increment past 1 on ip-10-122-24-2:20270
[js_test:rollback_test_control] 2020-04-09T20:41:02.533+0000 d20280| 2020-04-09T20:41:02.533+00:00 I  REPL     21088   [BackgroundSync] "Changed sync source","attr":{"oldSyncSource":"empty","newSyncSource":"ip-10-122-24-2:20271"}
[js_test:rollback_test_control] 2020-04-09T20:41:02.536+0000 d20280| 2020-04-09T20:41:02.536+00:00 I  ROLLBACK 21593   [BackgroundSync] "Transition to ROLLBACK"
[js_test:rollback_test_control] 2020-04-09T20:41:02.607+0000 d20280| 2020-04-09T20:41:02.607+00:00 I  REPL     21088   [BackgroundSync] "Changed sync source","attr":{"oldSyncSource":"empty","newSyncSource":"ip-10-122-24-2:20271"}
[js_test:rollback_test_control] 2020-04-09T20:41:02.806+0000 [jsTest] [RollbackTest] Rollback on ip-10-122-24-2:20270 completed, reconnecting tiebreaker

It is instead taking a long time for the rollback node to find the sync source that it is going to roll back against after we reconnect it. This is a byproduct of the various 1 second sleeps in bgsync.cpp that limit how fast we can re-select a sync source. This can be resolved by utilizing the 'forceBgSyncSyncSourceRetryWaitMS' failpoint to make these sleeps much shorter. The time it takes for the fixture to detect rollback completion is also limited by the retry interval of the assert.soonNoExcept loop, so these retry intervals can be shortened as well.

After rollback has completed, there is another slow section where we reconnect the tiebreaker and wait for it to replicate the latest ops, which takes 627ms from the time we reconnect the node to the time we finish waiting for replication:

[js_test:rollback_test_control] 2020-04-09T20:41:02.806+0000 [jsTest] [RollbackTest] Rollback on ip-10-122-24-2:20270 completed, reconnecting tiebreaker
[js_test:rollback_test_control] 2020-04-09T20:41:03.247+0000 d20282| 2020-04-09T20:41:03.247+00:00 I  REPL     21088   [BackgroundSync] "Changed sync source","attr":{"oldSyncSource":"ip-10-122-24-2:20270","newSyncSource":"ip-10-122-24-2:20271"}
[js_test:rollback_test_control] 2020-04-09T20:41:03.433+0000 [jsTest] [RollbackTest] awaitReplication completed

Once again, this is slow because of how long it takes for the tiebreaker node to re-establish a sync source after being re-connected. This appears to be caused by the 1 second sleep that is triggered inside the stopReplProducer failpoint. After we disable the failpoint, it may take up to a second for us to move past it. Furthermore, the awaitReplication retry interval limits how fast the fixture can detect newly replicated ops, so its retry interval should be shortened.

After we wait for replication on the tiebreaker, we do data consistency checks, which may take a non negligible amount of time i.e. several hundred milliseconds. We are not focused on speeding up consistency checks themselves in this ticket, so we will leave those as is.

Finally, we call stopServerReplication on the tiebreaker node as the final step of the transition to kSteadyStateOps. The time taken between actually re-enabling the failpoint and when we complete the transition is around 303ms. This appears to be a consequence of the 300ms retry interval used by checkLog.contains. That can be reduced to detect the new failpoint more quickly.

[js_test:rollback_test_control] 2020-04-09T20:41:03.894+0000 d20282| 2020-04-09T20:41:03.893+00:00 W  CONTROL  23829   [conn2] "Set failpoint","attr":{"failPointName":"stopReplProducer","failPoint":{"mode":1,"data":{},"timesEntered":4}}
[js_test:rollback_test_control] 2020-04-09T20:41:04.197+0000 [RollbackTest] RollbackTest transition to kSteadyStateOps took 2847 ms

Comment by William Schultz (Inactive) [ 17/Oct/19 ]

For some crude metrics on this, here are filtered logs showing all "transition to ROLLBACK from SECONDARY" and "transition to SECONDARY from ROLLBACK" log messages in an Evergreen run of replica_sets on an enterprise-rhel-62-64-bit variant: rollbacks.txt. And, to be more accurate, here are only tests that actually use RollbackTest: rollbacks_with_rollback_test.txt

Generated at Thu Feb 08 05:04:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.