Currently, the oplog read timestamp is set via the same asynchronous mechanism, regardless of replication state (PRIMARY or SECONDARY): a thread loop takes note of the latest oplog entry's optime with no holes after it, waits for journal, and then publishes that optime as the new oplog read value.
The algorithm is correct for primary nodes. However, as an optimization, it does not have to wait for journaling on secondary nodes, because it is never possible to read holes after an unclean shutdown of a secondary node (due to our durable storing of the last applied time). Today, we have a problem with the stable timestamp (and oldest timestamp) racing ahead of the oplog read timestamp on secondaries. By forgoing the wait for journaling on secondaries, we can set the oplog read timestamp in lock step with the stable timestamp and oldest timestamp, thus avoiding the race.
The work for this ticket will be to change the oplog read timestamp loop to only operate while a node is in primary mode; in secondary mode, new code inserted into the applier loop will set the oplog read timestamp when the last applied time is set.