Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 7.1.0-rc0, 4.4.22, 5.0.18, 7.0.0-rc1, 6.0.7
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v7.0, v6.0, v5.0, v4.4
Sprint:
Repl 2023-05-01, Repl 2023-05-15
Linked BF Score:
35
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Currently, this test expects that the OplogFetcher will always advance the latest commit point on a node during initial sync, and that by the time initial sync completes a new node will have a non-null latest commit point. However, it's possible that initial sync completes without this happening. Per vishnu.kaushik@mongodb.com :

Every time the OplogFetcher successfully fetches a batch, it processes the metadata it receives with the batch, which includes attempting to advance the commit timestamp. The commit timestamp is only advanced when the term of the last applied timestamp, pushed forward by the applier, is equal to the term of the new processed timestamp (roughly speaking). Consequently, the first time the OplogFetcher fetches a batch, the commit timestamp may not be pushed forward if nothing has been applied yet, i.e. if the last applied timestamp is unintialized (ts: Timestamp(0, 0), t: -1) then its term, -1, won't be equal to the new term.

Soon enough, the applier will apply entries and when the OplogFetcher processes the next batch, it ends up advancing the commit point.

Finally, once all entries are applied until the given point, the applier exists and then the InitialSyncer shuts the OplogFetcher down. The test uses the initialSyncHangBeforeCompletingOplogFetching failpoint to get to this point of after applying is complete, but before the OplogFetcher is told to shutdown.

Reaching this failpoint doesn't mean that the OplogFetcher has advanced the commit point, since that depends on the OplogFetcher having processed metadata after entries have been applied. Reaching the failpoint only indicates that entries have been applied.

Turning off that failpoint will let the InitialSyncer shut the OplogFetcher down. And if we shut the OplogFetcher down before the OplogFetcher had time to fetch another batch and process it, then the commit point will not have been advanced at all. It will still be uninitialized.

The test tries to avoid this by performing another write on the cluster, so that the OplogFetcher has something to fetch. But if that write is performed after the stopTimestamp, then there is nothing stopping the applier from exiting and telling the InitialSyncer that it is alright to shut the OplogFetcher down.

And afterwards, initial sync completes, and we try to get a commit point from the initial syncing node, but its commit timestamp is uninitialized.

If the OplogFetcher doesn't actually advance the last commit point, the node will shortly thereafter advance its last commit point once it receives a heartbeat from the primary and sees that its lastApplied optime is the same as the primary's durableOpTime.

Using assert.soon here will allow the test to pass in the rare case where the commit point is advanced slightly later via heartbeat, and will allow the test to proceed with asserting that the initialSyncNode gossips the last commit point to the other secondary.

Assignee:: Kaitlin Mahar
Reporter:: Kaitlin Mahar
Participants:: Githook User, Kaitlin Mahar
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Apr 28 2023 07:46:55 PM UTC
Updated:: Oct 29 2023 09:22:15 PM UTC
Resolved:: May 01 2023 07:02:25 PM UTC
Confidence Status Last Update:: 28/Apr/23 7:47 PM

Details

Description

Attachments

Forms

Activity

People

Dates