[DOCS-15476] [C2C] [Server] Investigate changes in REP-959: worse latencies and more CPU usage on the destination replica set after the sync is complete Created: 08/Jul/22  Updated: 08/Jan/24  Resolved: 18/Dec/23

Status: Closed
Project: Documentation
Component/s: C2C, Server
Affects Version/s: None
Fix Version/s: Server_Docs_[20240108]

Type: Task Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Alison Huh
Resolution: Won't Do Votes: 0
Labels: query, replication, server-docs-bug-bash, storage-engines
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
Related
Participants:
Days since reply: 7 weeks, 2 days ago
Epic Link: DOCSP-22764

 Description   
Original Downstream Change Summary

We should document that when setting up the destination cluster, users should set minSnapshotHistoryWindowSeconds=0 to avoid the issue of worse latency on the destination after the sync completes.

Description of Linked Ticket

Problem Statement/Rationale

What is going wrong? What action would you like the Engineering team to take?

After the sync is complete, the same workload baseline execution on the destination replica set shows 26% latency increase. There is also an increase in average User CPU of 26% while CPU IOWait increases by 17%, compared with the same Baseline Workload executed on the Source Replica Set.

Steps to Reproduce

How could an engineer replicate the issue you’re reporting?

This is a 3 node replica set mongosync test with the 100GB dataset.

Expected Results

What do you expect to happen?

Similar latencies and CPU/Storage usage on source and destination replica set.

Actual Results

What do you observe is happening?

After the sync is complete, the same baseline workload on the destination replica set shows much worse numbers.

Additional Notes

Any additional information that may be useful to include.

Even if the query metrics are exactly the same, the underlying data distribution of the WiredTiger table seems different. We can see that for the same number of documents accessed, the blocks, cache pages and bytes accessed are different:

-WiredTiger reads 71% more blocks and writes 16% more blocks to disk or filesystem cache.

  • WiredTiger reads 63% more pages and writes 39% more pages from disk or filesystem cache to WiredTiger cache.
  • WiredTiger reads 20% more bytes and writes 5.5% more bytes.

This could be caused by the difference in how the data was created in the first place (with the Genny loader) and how it was synced, ending with different physical layouts on disk. This affects the time needed to perform the checkpoints and the latencies significantly.

Let me know if you need me to attach any extra information.



 Comments   
Comment by Alison Huh [ 18/Dec/23 ]

Closing this ticket since the team mentions that further investigation in REP-3631 won't be done for the foreseeable future. Feel free to re-open this ticket or create a new one if REP-3631 is ever addressed.

Generated at Thu Feb 08 08:13:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.