-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: 4.4.7, 5.0.0-rc8
-
Component/s: None
-
None
-
Fully Compatible
-
ALL
-
v5.0, v4.4, v4.2, v4.0
-
Repl 2021-08-09, Repl 2021-08-23, Repl 2021-09-06, Repl 2021-09-20, Repl 2021-10-04, Repl 2021-10-18
-
115
-
None
-
None
-
None
-
None
-
None
-
None
-
None
(10/4/21 discoverability update: We figured out this happens due to the relevant secondary read getting a readSource of lastApplied (per most other use cases). Making that an untimestamped read solves the problem.)
The calls to applyCommand_inlock and scheduleOplogWrites in secondary application are not atomic. So it's possible that when an initial syncing node chooses a secondary as a sync source, it sees that a command like drop has been applied, but misses the oplog entry when calculating the stopTimestamp.
The following can happen:
- Initial syncing node sees the drop on collection foo has been applied on a secondary sync source (but no oplog write yet). The collectionCloner will stop with NamespaceNotFound error, expecting us to apply the drop during the initial sync oplog application phase.
- Initial syncing node fetches the lastApplied of the sync source, setting the stopTimestamp to T.
- The sync source writes the oplog for the drop from (1) at timestamp T + 1.
- The initial syncing node reaches stopTimestamp T, transitions to secondary, and applies the drop, and crashes because the collection does not exist.