[SERVER-47909] Fix mongos_large_catalog_workloads NetworkInterfaceExceededTimeLimit error Created: 02/May/20  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.4.0
Fix Version/s: None

Type: Task Priority: Minor - P4
Reporter: Lamont Nelson Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-60578 [4.4] Use higher timeouts for config ... Closed
Assigned Teams:
Cluster Scalability
Sprint: Service arch 2020-05-18
Participants:
Linked BF Score: 0

 Description   

In this workload we see latency spikes for catalog metadata operations on the config server into the ~20s range, and occasionally the operations timeout at 30s.

Adjust the server parameters and/or test so that the latency for operations is < 1s.

1. Turn off flow control to see if it has any effect, or
2. Reduce the batch size in the workload to limit the per-operation latency on the config server, or
3. Reduce the concurrency to limit the per-operation latency on the config server

Or some combination.



 Comments   
Comment by Lamont Nelson [ 04/May/20 ]

I don't believe this is a new issue or related to changes from the RSM. The 20 - 30s latency is measured internally on a single config server, and doesn't include any network communication. There's evidence in the BF that it could be the flow control, which I was hoping to either confirm or rule out with this ticket. SERVER-45880 is an existing flow control ticket that may explain what is happening if that is the case.

The portion of the test that is causing the latency spikes is the "setup" portion where we are inserting up to 1M new collections into the catalog from 32 threads so that we can test reads and writes with this large amount of metadata. I checked much older builds from the beginning of March where the build is green, and we see latency spikes in the 10s of seconds, rather than the 20 - 30 we see now. So the workload was showing early signs of eventually hitting the timeout value of 30s, and this single node has definitely slowed over time. I think we should see similar performance impact on a single node replica set test with similar batching/document size parameters. But as it stands now we aren't able to measure the performance with large catalogs due to this error in the test setup.

Generated at Thu Feb 08 05:15:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.