(10/4/21 discoverability update: We figured out this happens due to the relevant secondary read getting a readSource of lastApplied (per most other use cases). Making that an untimestamped read solves the problem.)
The calls to applyCommand_inlock and scheduleOplogWrites in secondary application are not atomic. So it's possible that when an initial syncing node chooses a secondary as a sync source, it sees that a command like drop has been applied, but misses the oplog entry when calculating the stopTimestamp.
The following can happen:
- Initial syncing node sees the drop on collection foo has been applied on a secondary sync source (but no oplog write yet). The collectionCloner will stop with NamespaceNotFound error, expecting us to apply the drop during the initial sync oplog application phase.
- Initial syncing node fetches the lastApplied of the sync source, setting the stopTimestamp to T.
- The sync source writes the oplog for the drop from (1) at timestamp T + 1.
- The initial syncing node reaches stopTimestamp T, transitions to secondary, and applies the drop, and crashes because the collection does not exist.