Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-79785

WaitForMajorityService can be a bottleneck for two phase commit transactions

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Sharding NYC
    • ALL
    • Hide
      1. Start up the shard setup in DSI using this binary
      2. scp the mongo_updateone locust workload to the workload client and set it up
        1. Or use any workload with concurrent two phase commit transactions
      3. Disable the custom "canUseSingleWriteCommit" server parameter on each mongos to force the non-targeted single writes from the workload to use two phase commit
      4. Disable all new server parameters from the binary to test baseline 2PC performance or enable any to see performance without the bottleneck (by default txnMajorityWaitInReplCoordinator is enabled which uses the async awaitReplication fix)
      Show
      Start up the shard setup in DSI using this binary scp the mongo_updateone locust workload to the workload client and set it up Or use any workload with concurrent two phase commit transactions Disable the custom "canUseSingleWriteCommit" server parameter on each mongos to force the non-targeted single writes from the workload to use two phase commit Disable all new server parameters from the binary to test baseline 2PC performance or enable any to see performance without the bottleneck (by default txnMajorityWaitInReplCoordinator is enabled which uses the async awaitReplication fix)

      While performance testing for SERVER-79056, I noticed throughput for transactions that use two phase commit scales poorly with more concurrent transactions, despite CPU and IO utilization staying low and secondaries keeping up. The problem seems to be the WaitForMajorityService used by two phase commit coordinators to wait for the participant list and decision writes to majority replicate can't keep up with many requests to wait for majority.

      When I switch transaction coordinators to either wait for majority write concern as part of the writes themselves (which synchronously blocks a task executor thread) or wait asynchronously using ReplicationCoordinator::awaitReplicationAsyncNoWTimeout, throughput with the same workload goes up significantly (over 4x with my setup) and CPU becomes the bottleneck. I initially saw this in the shard DSI workload with custom 0.3ms network delay, which uses 3 node replica sets, but I reproduced it in a modified shard workload with single node replica sets.

      The problem with the WaitForMajorityService seems to be that it waits for only the lowest opTime it's been given in each loop of _periodicallyWaitForMajority(), so if it receives new opTimes faster than it can wait for them, requests queue up and latency increases significantly. I modified the service to get the latest committed snapshot opTime after waiting for majority and pretend that was the most recently waited for time if it is greater than the actually waited on time (using ReplicationCoordinator::getCurrentCommittedSnapshotOpTime), and that seemed to resolve the bottleneck as well.

        1. 2PC_locust_results.txt
          7 kB
          Jack Mulrow
        2. mongo_updateone.tar.gz
          11 kB
          Jack Mulrow

            Assignee:
            backlog-server-sharding-nyc [DO NOT USE] Backlog - Sharding NYC
            Reporter:
            jack.mulrow@mongodb.com Jack Mulrow
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: