[SERVER-24499] Write optimizations for linearizable reads Created: 09/Jun/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Hari Devaraj Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: PM-173
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-24497 implement noop writes to test for pri... Closed
Assigned Teams:
Replication
Participants:

 Description   

Look for writes sent to primary since the read operation was sent to the replication set. If such a write exists, check to see if the write has been propagated to a majority of the nodes in the system. If so, current primary is the true primary. This should reduce the number of noops we use, as we are checking for a user submitted write (that would have had to be done anyways).

At the beginning of a linearizable read, the server shall take notice of the current last set timestamp (LST). This is the optime of the last write across the entire mongod instance that has already been returned to a client. Technically, this is the optime assigned to the “last” document written to the oplog, but due to multiple writers and concurrency logic to produce the illusion of a monotonically increasing optime, it may not necessarily be visible in the oplog just yet. This value, which we will call original LST, will be used later on to determine if any writes have completed and committed during the period of time while the read is being processed.

(Optimization 1) When the server finishes the read, it observes the commit level (optime of the last committed operation). If the commit level is higher (greater than) the original LST, it means that a write that completed after the client issued the linearizable read has now been committed by a majority of nodes – which proves that the server was still a valid primary at the time the read began. This confirms that the read can be linearizable, so the server returns the data.

(Optimization 2) If the commit level is not higher than the original LST, it then observes the current last set timestamp. If the current LST is greater than the original LST, this means at least one write is currently replicating and may soon move the commit level. The server blocks until either the condition in step 1 above is reached, or maxTimeMS is reached (timeout).

If the current LST is the same as the original LST, no writes have occurred to prove primaryship. The server shall write an ‘n’ op to the oplog, and then block until either the no-op gets replicated to the majority of nodes, or a timeout occurs. This part has been implemented in SERVER-24497.


Generated at Thu Feb 08 04:06:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.