Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24499

Write optimizations for linearizable reads

    • Type: Icon: Task Task
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • Replication

      Look for writes sent to primary since the read operation was sent to the replication set. If such a write exists, check to see if the write has been propagated to a majority of the nodes in the system. If so, current primary is the true primary. This should reduce the number of noops we use, as we are checking for a user submitted write (that would have had to be done anyways).

      At the beginning of a linearizable read, the server shall take notice of the current last set timestamp (LST). This is the optime of the last write across the entire mongod instance that has already been returned to a client. Technically, this is the optime assigned to the “last” document written to the oplog, but due to multiple writers and concurrency logic to produce the illusion of a monotonically increasing optime, it may not necessarily be visible in the oplog just yet. This value, which we will call original LST, will be used later on to determine if any writes have completed and committed during the period of time while the read is being processed.

      (Optimization 1) When the server finishes the read, it observes the commit level (optime of the last committed operation). If the commit level is higher (greater than) the original LST, it means that a write that completed after the client issued the linearizable read has now been committed by a majority of nodes – which proves that the server was still a valid primary at the time the read began. This confirms that the read can be linearizable, so the server returns the data.

      (Optimization 2) If the commit level is not higher than the original LST, it then observes the current last set timestamp. If the current LST is greater than the original LST, this means at least one write is currently replicating and may soon move the commit level. The server blocks until either the condition in step 1 above is reached, or maxTimeMS is reached (timeout).

      If the current LST is the same as the original LST, no writes have occurred to prove primaryship. The server shall write an ‘n’ op to the oplog, and then block until either the no-op gets replicated to the majority of nodes, or a timeout occurs. This part has been implemented in SERVER-24497.

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            hari.devaraj Hari Devaraj
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: