See this doc for a more detailed description of how waiting for journaling works today.
Right now, for (implicit and explicit) j:1 writes we wait for durability on the current node (which also includes waiting for no oplog holes if voters=1) prior to waiting for replication. However, the mechanism used for waiting for replication is redundant with the mechanism for waiting for durability now that the repl and topo coordinators track the durable point for all nodes including themselves. We should therefore skip waiting for durability locally and just go straight to waiting for replication.
There are a few things to be careful with when doing this:
- We currently skip waiting for replication for w:1 writes because we aren't waiting for other nodes.
- We need to decide what semantics we want for j:1 writes. Currently it requires that it is durable on the current node. While we can preserve that behavior by changing what we consider ready to confirm in the awaitReplication checks, I'm not sure what we want here. At least for w:majority, j:1 writes, it seems reasonable to return once it is durable on any set of nodes that constitutes a majority, even if it doesn't include the current primary. For w:1, j:1 I could imagine someone expecting it to be durable on the current node, but I don't think that is really meaningful in our model.
- We need to be careful if the command does any non-replicated writes either after replicated writes or if there are no replicated writes. The awaitReplication logic can only wait for a specific optime, but non-replicated writes may happen after the latest optime. We need to avoid reintroducing a variant of SERVER-81780.
- We need to ensure that something tells the JournalFlusher thread that it should run. That is currently handled by the logic to wait for local journaling, but if we skip that, it won't happen any more. I POCed kicking that thread from every WUOW::commit() right next to where we currently kick the OplogVisibilityThread. That should work fine as long as we carefully use atomics so that we only need to acquire the JournalFlusher's mutex once by a single thread each time it loops. Otherwise it risks becoming an additional contention point. I'm not sure if we want to do this on secondaries or just the primary.
Note that for most user-initiated writes (which tend to be replicated), the awaitReplication logic does a better job of waiting for journaling than the logic that waits for durability:
- The durability wait first waits for there to be no oplog holes. However it does this by checking the cache maintained by the OplogVisibilityThread rather than asking WT directly, and because waiters don't tap the cv (either intentionally or as an oversight), that thread will wait up to 1ms or until the next `WUOW::commit()` which is a real problem for single-threaded writers.
- This is only done when there is a single voter in the replica set. Because of the oplogTruncatePoint, we really should be doing this for all replica sets. However, because we also check this in awaitReplication, we are ok.
- Both the no holes wait and the wait for journaling wait for the next pass of their respecitive threads after "now" rather than waiting for after the optime of the operation. This is obviously wrong if the write happens to be durable by the time we reach the wait, but it is also problematic because it will wait for any other ops that happen to come in in parallel before we reach that point. And if there is a single voter, the "now" point used for waiting for journalling is after we have waited for no holes, so is likely to include even more unrelated writes.
- When we need to wait for replication, it is likely (but not guaranteed) that the replication will complete after the local journaling. This means that our thread may wait and wake up 2 or 3 times in the process of waiting for write concern while if we centralize all waiting in awaitReplication (which perhaps should be renamed) then we will only need to wait/wake once, which reduces needless context switches.