Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-46308

Investigate dependency between commit point (lastCommitted) and cluster time

    • Fully Compatible
    • ALL
    • Sharding 2020-04-20, Sharding 2020-05-04, Sharding 2020-05-18, Sharding 2020-07-13, Sharding 2020-06-01, Sharding 2020-06-15, Sharding 2020-06-29, Sharding 2020-07-27, Sharding 2020-08-10, Sharding 2020-08-24, Sharding 2020-09-21, Sharding 2020-10-05, Sharding 2020-10-19, Sharding 2020-11-02, Sharding 2020-11-16, Sharding 2020-11-30, Sharding 2020-12-14, Sharding 2020-12-28, Sharding 2021-01-11, Sharding 2021-01-25, Sharding 2021-02-22, Sharding 2021-03-08, Sharding 2021-03-22, Sharding 2021-04-05, Sharding 2021-04-19, Sharding 2021-05-03

      computeOperationTime returns the replication lastCommittedOpTime for majority reads or the lastAppliedOpTime otherwise. And we have this dassert that makes sure that neither of them should be beyond the cluster time.

      We have a dassert in ReplicationCoordinatorImpl::_setMyLastAppliedOpTimeAndWallTime to assert that the lastAppliedOpTime should never advance beyond the cluster time. However, we do not have a corresponding dassert for the lastCommittedOpTime.

      Normally, for internal communications between replset members, we have the LogicalTimeMetadataHook to parse cluster time metadata and advance the cluster time on receiving a network message. For example, when a node receives a heartbeat response, it parses the cluster time metadata as part of the network interface hooks before handing it off to repl to process the heartbeat response. And when the node processes the heartbeat, it could advance the commit point on hearing a more recent commit point. So the assumption that the commit point is never ahead of the cluster time is normally correct because we parse the cluster time metadata first.

      However, if a heartbeat response comes from an arbiter, it could contain a more recent commit point without cluster time metadata, simply because logical clock is disabled for arbiters. So theoretically, the following could happen:
      1. secondary's current knowledge of the cluster time is 90
      2. secondary receives a heartbeat response without cluster time metadata from an arbiter that has a commit point 100
      3. secondary processes the heartbeat and advances its commit point (lastCommitted) to 100.
      4. A majority read on the secondary returns operation time 100 from computeOperationTime
      5. dassert is hit because the secondary's logical clock is 90, being < the operation time.

      The fact that computeOperationTime returns lastCommitted for majority reads seems weird. Because for majority reads, they dont actually read at lastCommitted but the committed snapshot. And repl guarantees that the committed snapshot is never ahead of the lastApplied, (whereas lastCommitted could). So I think it is more correct for computeOperationTime to return the committed snapshot for majority reads.

      This ticket should investigate whether what mentioned above could actually happen with arbiters. No matter which time we decide computeOperationTime should return for majority reads, we should add a corresponding dassert in repl like we did for the lastApplied to enforce that assumption. And we should have a targeted test for it.

            Assignee:
            randolph@mongodb.com Randolph Tan
            Reporter:
            lingzhi.deng@mongodb.com Lingzhi Deng
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: