[SERVER-48214] ReplSetTest awaitReplication is too verbose Created: 14/May/20  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Eric Milkie Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Replication
Participants:

 Description   

While awaitReplication is looping, it logs messages every 2 tenths of a second. In some cases, it can log 9 or more lines in each batch of messages. For some tests, especially when they are failing, this can overwhelm the log and cause lengthy load times and web browser memory issues, making the BF diagnosis experience worse than it needs to be. I think some of these messages can be eliminated or combined while still giving all the information.

Example:

 
[js_test:backup_restore_stop_start] 2020-05-14T07:31:48.896+0000 sh9428| [WaitForReplication:job0] ReplSetTest awaitReplication: checking secondaries against latest primary optime { "ts" : Timestamp(1589441508, 23192), "t" : NumberLong(1) }
[js_test:backup_restore_stop_start] 2020-05-14T07:31:48.897+0000 sh9428| [WaitForReplication:job0] ReplSetTest awaitReplication: checking secondary #0: ip-10-122-50-53:20271
[js_test:backup_restore_stop_start] 2020-05-14T07:31:48.898+0000 sh9428| [WaitForReplication:job0] ReplSetTest awaitReplication: optime for secondary #0, ip-10-122-50-53:20271, is { "ts" : Timestamp(1589441505, 17394), "t" : NumberLong(1) } but latest is { "ts" : Timestamp(1589441508, 23192), "t" : NumberLong(1) }
[js_test:backup_restore_stop_start] 2020-05-14T07:31:48.899+0000 sh9428| [WaitForReplication:job0] ReplSetTest awaitReplication: secondary #0, ip-10-122-50-53:20271, is NOT synced
[js_test:backup_restore_stop_start] 2020-05-14T07:31:48.899+0000 sh9428| [WaitForReplication:job0] ReplSetTest awaitReplication: secondary #0, ip-10-122-50-53:20271, has made progress. Resetting awaitReplication timeout
[js_test:backup_restore_stop_start] 2020-05-14T07:31:48.899+0000 sh9428| [WaitForReplication:job0] ReplSetTest awaitReplication: checking secondaries against latest primary optime { "ts" : Timestamp(1589441508, 23192), "t" : NumberLong(1) }
[js_test:backup_restore_stop_start] 2020-05-14T07:31:48.900+0000 sh9428| [WaitForReplication:job0] ReplSetTest awaitReplication: checking secondary #0: ip-10-122-50-53:20271
[js_test:backup_restore_stop_start] 2020-05-14T07:31:48.901+0000 sh9428| [WaitForReplication:job0] ReplSetTest awaitReplication: optime for secondary #0, ip-10-122-50-53:20271, is { "ts" : Timestamp(1589441505, 17394), "t" : NumberLong(1) } but latest is { "ts" : Timestamp(1589441508, 23192), "t" : NumberLong(1) }
[js_test:backup_restore_stop_start] 2020-05-14T07:31:48.901+0000 sh9428| [WaitForReplication:job0] ReplSetTest awaitReplication: secondary #0, ip-10-122-50-53:20271, is NOT synced



 Comments   
Comment by Eric Milkie [ 10/Jul/20 ]

I think that solution would slow down the test suite, so I'd rather avoid that. Can we eliminate extraneous messages instead?

For example, this message is useless since the info is already gleanable in other messages:
secondary #0, ip-10-122-50-53:20271, has made progress. Resetting awaitReplication timeout

This message doesn't need to be on a separate line:
secondary #0, ip-10-122-50-53:20271, is NOT synced

This message has little utility and should be removed, since the following line makes it obvious what was being checked:
checking secondary #0: ip-10-122-50-53:20271

Comment by Judah Schvimer [ 10/Jul/20 ]

milkie, would assert.soon having an exponential backoff be an acceptable solution?

Generated at Thu Feb 08 05:16:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.