[SERVER-76695] initial_sync_nodes_maintain_and_gossip_commit_point.js should wait longer for initial sync node to advance commit point Created: 28/Apr/23  Updated: 29/Oct/23  Resolved: 01/May/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 4.4.22, 5.0.18, 7.0.0-rc1, 6.0.7

Type: Bug Priority: Major - P3
Reporter: Kaitlin Mahar Assignee: Kaitlin Mahar
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0, v6.0, v5.0, v4.4
Sprint: Repl 2023-05-01, Repl 2023-05-15
Participants:
Linked BF Score: 35

 Description   

Currently, this test expects that the OplogFetcher will always advance the latest commit point on a node during initial sync, and that by the time initial sync completes a new node will have a non-null latest commit point. However, it's possible that initial sync completes without this happening. Per vishnu.kaushik@mongodb.com :

Every time the OplogFetcher successfully fetches a batch, it processes the metadata it receives with the batch, which includes attempting to advance the commit timestamp. The commit timestamp is only advanced when the term of the last applied timestamp, pushed forward by the applier, is equal to the term of the new processed timestamp (roughly speaking). Consequently, the first time the OplogFetcher fetches a batch, the commit timestamp may not be pushed forward if nothing has been applied yet, i.e. if the last applied timestamp is unintialized (ts: Timestamp(0, 0), t: -1) then its term, -1, won't be equal to the new term.

Soon enough, the applier will apply entries and when the OplogFetcher processes the next batch, it ends up advancing the commit point.

Finally, once all entries are applied until the given point, the applier exists and then the InitialSyncer shuts the OplogFetcher down. The test uses the initialSyncHangBeforeCompletingOplogFetching failpoint to get to this point of after applying is complete, but before the OplogFetcher is told to shutdown.

Reaching this failpoint doesn't mean that the OplogFetcher has advanced the commit point, since that depends on the OplogFetcher having processed metadata after entries have been applied. Reaching the failpoint only indicates that entries have been applied.

Turning off that failpoint will let the InitialSyncer shut the OplogFetcher down. And if we shut the OplogFetcher down before the OplogFetcher had time to fetch another batch and process it, then the commit point will not have been advanced at all. It will still be uninitialized.

The test tries to avoid this by performing another write on the cluster, so that the OplogFetcher has something to fetch. But if that write is performed after the stopTimestamp, then there is nothing stopping the applier from exiting and telling the InitialSyncer that it is alright to shut the OplogFetcher down.

And afterwards, initial sync completes, and we try to get a commit point from the initial syncing node, but its commit timestamp is uninitialized.

If the OplogFetcher doesn't actually advance the last commit point, the node will shortly thereafter advance its last commit point once it receives a heartbeat from the primary and sees that its lastApplied optime is the same as the primary's durableOpTime.

Using assert.soon here will allow the test to pass in the rare case where the commit point is advanced slightly later via heartbeat, and will allow the test to proceed with asserting that the initialSyncNode gossips the last commit point to the other secondary.



 Comments   
Comment by Githook User [ 02/May/23 ]

Author:

{'name': 'Kaitlin Mahar', 'email': 'kaitlin.mahar@mongodb.com', 'username': 'kmahar'}

Message: SERVER-76695 initial_sync_nodes_maintain_and_gossip_commit_point.js should wait longer for initial sync node to advance commit point

(cherry picked from commit 2595c52e7ebbe72a41d541214f60828dcdf0d18b)
Branch: v5.0
https://github.com/mongodb/mongo/commit/d54a2e5c1c09bd020d6c386cc18b499fa74b7b9f

Comment by Githook User [ 02/May/23 ]

Author:

{'name': 'Kaitlin Mahar', 'email': 'kaitlin.mahar@mongodb.com', 'username': 'kmahar'}

Message: SERVER-76695 initial_sync_nodes_maintain_and_gossip_commit_point.js should wait longer for initial sync node to advance commit point

(cherry picked from commit 2595c52e7ebbe72a41d541214f60828dcdf0d18b)
Branch: v4.4
https://github.com/mongodb/mongo/commit/1cfc3c9fb940b8f26d376bc1481ad983cc9cad8f

Comment by Githook User [ 02/May/23 ]

Author:

{'name': 'Kaitlin Mahar', 'email': 'kaitlin.mahar@mongodb.com', 'username': 'kmahar'}

Message: SERVER-76695 initial_sync_nodes_maintain_and_gossip_commit_point.js should wait longer for initial sync node to advance commit point

(cherry picked from commit 30d3ac6fac2b5e5f82520ba3ea68f1bc04d19b70)
Branch: v6.0
https://github.com/mongodb/mongo/commit/2595c52e7ebbe72a41d541214f60828dcdf0d18b

Comment by Githook User [ 02/May/23 ]

Author:

{'name': 'Kaitlin Mahar', 'email': 'kaitlin.mahar@mongodb.com', 'username': 'kmahar'}

Message: SERVER-76695 initial_sync_nodes_maintain_and_gossip_commit_point.js should wait longer for initial sync node to advance commit point

(cherry picked from commit 30d3ac6fac2b5e5f82520ba3ea68f1bc04d19b70)
Branch: v7.0
https://github.com/mongodb/mongo/commit/98e0f3c66aa337a26893f20bffa58d01f5a38942

Comment by Githook User [ 01/May/23 ]

Author:

{'name': 'Kaitlin Mahar', 'email': 'kaitlin.mahar@mongodb.com', 'username': 'kmahar'}

Message: SERVER-76695 initial_sync_nodes_maintain_and_gossip_commit_point.js should wait longer for initial sync node to advance commit point
Branch: master
https://github.com/mongodb/mongo/commit/30d3ac6fac2b5e5f82520ba3ea68f1bc04d19b70

Generated at Thu Feb 08 06:33:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.