Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-15476

[C2C] [Server] Investigate changes in REP-959: worse latencies and more CPU usage on the destination replica set after the sync is complete

      Original Downstream Change Summary

      We should document that when setting up the destination cluster, users should set minSnapshotHistoryWindowSeconds=0 to avoid the issue of worse latency on the destination after the sync completes.

      Description of Linked Ticket

      Problem Statement/Rationale

      What is going wrong? What action would you like the Engineering team to take?

      After the sync is complete, the same workload baseline execution on the destination replica set shows 26% latency increase. There is also an increase in average User CPU of 26% while CPU IOWait increases by 17%, compared with the same Baseline Workload executed on the Source Replica Set.

      Steps to Reproduce

      How could an engineer replicate the issue you’re reporting?

      This is a 3 node replica set mongosync test with the 100GB dataset.

      Expected Results

      What do you expect to happen?

      Similar latencies and CPU/Storage usage on source and destination replica set.

      Actual Results

      What do you observe is happening?

      After the sync is complete, the same baseline workload on the destination replica set shows much worse numbers.

      Additional Notes

      Any additional information that may be useful to include.

      Even if the query metrics are exactly the same, the underlying data distribution of the WiredTiger table seems different. We can see that for the same number of documents accessed, the blocks, cache pages and bytes accessed are different:

      -WiredTiger reads 71% more blocks and writes 16% more blocks to disk or filesystem cache.

      • WiredTiger reads 63% more pages and writes 39% more pages from disk or filesystem cache to WiredTiger cache.
      • WiredTiger reads 20% more bytes and writes 5.5% more bytes.

      This could be caused by the difference in how the data was created in the first place (with the Genny loader) and how it was synced, ending with different physical layouts on disk. This affects the time needed to perform the checkpoints and the latencies significantly.

      Let me know if you need me to attach any extra information.

            Assignee:
            alison.huh@mongodb.com Alison Huh
            Reporter:
            backlog-server-pm Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved:
              22 weeks, 5 days ago