[SERVER-45400] Reduce OplogFetcher getMore awaitData timeout for nodes in RollbackTest fixture Created: 07/Jan/20  Updated: 29/Oct/23  Resolved: 14/Jan/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.3.3

Type: Improvement Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: Ryan Timmons
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Problem/Incident
is caused by SERVER-37390 RollbackTestFixture doesn't need to w... Closed
Related
related to SERVER-41780 prepareTransaction doesn't always wai... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.2, v4.0
Sprint: Repl 2020-01-27
Participants:

 Description   

SERVER-37390 made it so that the RollbackTest fixture uses a high election timeout (24 hours) to prevent unplanned state transitions during test execution. This has the unintended side effect of slowing down tests that use this fixture since the awaitData timeout on getMores sent by the OplogFetcher are affected by the election timeout. A 24 hour election timeout causes us to use the maximum awaitData timeout which is 30 seconds. When we stop replication throughout the RollbackTest, this can cause delays when we wait for the stopReplProducer failpoint to be hit, since a getMore doesn't time out and return to the syncing node for 30 seconds even if there is no data to fetch. We can reduce this timeout inside RollbackTest by using this existing failpoint.



 Comments   
Comment by Githook User [ 14/Jan/20 ]

Author:

{'name': 'Ryan Timmons', 'email': 'ryan.timmons@mongodb.com', 'username': 'rtimmons'}

Message: SERVER-45400 Reduce OplogFetcher getMore awaitData timeout for nodes in RollbackTest
Branch: master
https://github.com/mongodb/mongo/commit/e1b4db12bbe172da6381ff881305752faad1c740

Comment by Judah Schvimer [ 07/Jan/20 ]

Bumping up to "major" since I think this is an important improvement.

Comment by William Schultz (Inactive) [ 07/Jan/20 ]

Here is a filtered log from a recent execution of jstests/replsets/rollback_rename_collection_on_sync_source.js where we see two delays >20 seconds due to the high getMore timeout:

[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:07:53.508-0500 ReplSetTest startSet took 10775ms for 3 nodes.
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:07:54.005-0500 ReplSetTest initiate command took 345ms for 3 nodes in set 'RollbackTest'
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:07:54.978-0500 ReplSetTest initiate reconfig and awaitSecondaryNodes took 700ms for 3 nodes in set 'RollbackTest'
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:07:56.176-0500 ReplSetTest initiate awaitLastStableRecoveryTimestamp took 498ms for 3 nodes in set 'RollbackTest'
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:07:56.654-0500 ReplSetTest initiateWithAnyNodeAsPrimary took 3138ms for 3 nodes.
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:07:57.605-0500 ReplSetTest initiateWithNodeZeroAsPrimary took 4093ms for 3 nodes.
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:07:58.224-0500 [jsTest] Stopping replication on williams-ubuntu:34022
 
// 30 second delay.
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:07:58.272-0500 d34032| 2020-01-07T15:07:58.272-0500 W  CONTROL  [conn2] failpoint: stopReplProducer set to: { mode: 1, data: {}, timesEntered: 0 }
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:28.213-0500 d34032| 2020-01-07T15:08:28.213-0500 I  REPL     [BackgroundSync] bgsync - stopReplProducer fail point enabled. Blocking until fail point is disabled.
 
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:28.959-0500 [jsTest] [RollbackTest] Transitioning to: "kRollbackOps"
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:28.959-0500 [RollbackTest] Isolating the secondary williams-ubuntu:34021 from the tiebreaker
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:28.970-0500 [RollbackTest] Isolating the primary williams-ubuntu:34020 from the secondary williams-ubuntu:34021
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:29.033-0500 [jsTest] [RollbackTest] Transitioning to: "kSyncSourceOpsBeforeRollback"
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:29.080-0500 [RollbackTest] Isolating the primary williams-ubuntu:34020 so it will step down
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:29.214-0500 d34032| 2020-01-07T15:08:29.214-0500 I  REPL     [BackgroundSync] bgsync - stopReplProducer fail point enabled. Blocking until fail point is disabled.
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:30.215-0500 d34032| 2020-01-07T15:08:30.214-0500 I  REPL     [BackgroundSync] bgsync - stopReplProducer fail point enabled. Blocking until fail point is disabled.
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:30.716-0500 [jsTest] [RollbackTest] Transitioning to: "kSyncSourceOpsDuringRollback"
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:31.215-0500 d34032| 2020-01-07T15:08:31.215-0500 I  REPL     [BackgroundSync] bgsync - stopReplProducer fail point enabled. Blocking until fail point is disabled.
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:32.202-0500 d34032| 2020-01-07T15:08:32.202-0500 W  CONTROL  [conn2] failpoint: stopReplProducer set to: { mode: 0, data: {}, timesEntered: 5 }
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:32.854-0500 [jsTest] [RollbackTest] Transitioning to: "kSteadyStateOps"
 
// 25 second delay
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:08:47.485-0500 d34032| 2020-01-07T15:08:47.485-0500 W  CONTROL  [conn2] failpoint: stopReplProducer set to: { mode: 1, data: {}, timesEntered: 5 }
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:09:12.298-0500 d34032| 2020-01-07T15:09:12.298-0500 I  REPL     [BackgroundSync] bgsync - stopReplProducer fail point enabled. Blocking until fail point is disabled.
 
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:09:12.539-0500 d34032| 2020-01-07T15:09:12.539-0500 W  CONTROL  [conn2] failpoint: stopReplProducer set to: { mode: 0, data: {}, timesEntered: 7 }
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:09:27.041-0500 [jsTest] [RollbackTest] Transitioning to: "kStopped"
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:09:31.309-0500 ReplSetTest stopSet data consistency checks finished, took 4264ms for 3 nodes.
[js_test:rollback_rename_collection_on_sync_source] 2020-01-07T15:09:35.121-0500 ReplSetTest stopSet stopped all replica set nodes, took 3675ms for 3 nodes.
[executor:js_test:job0] 2020-01-07T15:09:35.275-0500 rollback_rename_collection_on_sync_source.js ran in 115.53 seconds: no failures detected.

This is an example of another, non rollback JS test that points out the same issue.

Generated at Thu Feb 08 05:08:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.