Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-76695

initial_sync_nodes_maintain_and_gossip_commit_point.js should wait longer for initial sync node to advance commit point

    • Fully Compatible
    • ALL
    • v7.0, v6.0, v5.0, v4.4
    • Repl 2023-05-01, Repl 2023-05-15
    • 35

      Currently, this test expects that the OplogFetcher will always advance the latest commit point on a node during initial sync, and that by the time initial sync completes a new node will have a non-null latest commit point. However, it's possible that initial sync completes without this happening. Per vishnu.kaushik@mongodb.com :

      Every time the OplogFetcher successfully fetches a batch, it processes the metadata it receives with the batch, which includes attempting to advance the commit timestamp. The commit timestamp is only advanced when the term of the last applied timestamp, pushed forward by the applier, is equal to the term of the new processed timestamp (roughly speaking). Consequently, the first time the OplogFetcher fetches a batch, the commit timestamp may not be pushed forward if nothing has been applied yet, i.e. if the last applied timestamp is unintialized (ts: Timestamp(0, 0), t: -1) then its term, -1, won't be equal to the new term.

      Soon enough, the applier will apply entries and when the OplogFetcher processes the next batch, it ends up advancing the commit point.

      Finally, once all entries are applied until the given point, the applier exists and then the InitialSyncer shuts the OplogFetcher down. The test uses the initialSyncHangBeforeCompletingOplogFetching failpoint to get to this point of after applying is complete, but before the OplogFetcher is told to shutdown.

      Reaching this failpoint doesn't mean that the OplogFetcher has advanced the commit point, since that depends on the OplogFetcher having processed metadata after entries have been applied. Reaching the failpoint only indicates that entries have been applied.

      Turning off that failpoint will let the InitialSyncer shut the OplogFetcher down. And if we shut the OplogFetcher down before the OplogFetcher had time to fetch another batch and process it, then the commit point will not have been advanced at all. It will still be uninitialized.

      The test tries to avoid this by performing another write on the cluster, so that the OplogFetcher has something to fetch. But if that write is performed after the stopTimestamp, then there is nothing stopping the applier from exiting and telling the InitialSyncer that it is alright to shut the OplogFetcher down.

      And afterwards, initial sync completes, and we try to get a commit point from the initial syncing node, but its commit timestamp is uninitialized.

      If the OplogFetcher doesn't actually advance the last commit point, the node will shortly thereafter advance its last commit point once it receives a heartbeat from the primary and sees that its lastApplied optime is the same as the primary's durableOpTime.

      Using assert.soon here will allow the test to pass in the rare case where the commit point is advanced slightly later via heartbeat, and will allow the test to proceed with asserting that the initialSyncNode gossips the last commit point to the other secondary.

            kaitlin.mahar@mongodb.com Kaitlin Mahar
            kaitlin.mahar@mongodb.com Kaitlin Mahar
            0 Vote for this issue
            2 Start watching this issue