[DOCS-15433] [SERVER] Secondary slowdown or hang due to pinned content Created: 21/Jun/22  Updated: 13/Nov/23  Resolved: 18/Aug/22

Status: Closed
Project: Documentation
Component/s: manual, Server
Affects Version/s: None
Fix Version/s: 5.0.0-rc0, 4.2.16, 4.0.27, 4.4.9, Server_Docs_20231030, Server_Docs_20231106, Server_Docs_20231105, Server_Docs_20231113

Type: Task Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Dave Cuthbert (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
documents SERVER-34938 Secondary slowdown or hang due to con... Closed
Participants:
Days since reply: 1 year, 24 weeks, 6 days ago
Epic Link: DOCSP-15042

 Description   
Original Downstream Change Summary

Document that maxNumberOfThreads is now capped with number of available cores.

Description of Linked Ticket

We only advance the oldest timestamp at oplog batch boundaries. This means that all dirty content generated by the application of the operations in a single batch will be pinned in cache. If the batch is large enough and the operations are heavy enough this dirty content can exceed eviction_dirty_trigger (default 20% of cache) and the rate of applying operations will become dramatically slower because it has to wait for the dirty data to be reduced below the threshold.

This can be triggered by a momentary slowdown on a secondary causing it to lag momentarily, so the next batch it processes will be unusually large, causing it to exceed 20% dirty cache. This will cause it to lag even further, so the next batch will be even larger, and so on. In extreme cases the node can become completely stuck due to full cache preventing a batch from completing and unpinning the data that is keeping the cache full.

This can also occur if a secondary is offline for maintenance; when it comes back online and begins to catch up, it will be processing large batches that risk exceeding the dirty trigger threshold, so it may apply operations at a much slower rate than a secondary that is keeping up and processing operations in small batches.



 Comments   
Comment by Githook User [ 18/Aug/22 ]

Author:

{'name': 'Dave Cuthbert', 'email': '69165704+davemungo@users.noreply.github.com', 'username': 'davemungo'}

Message: DOCS-15433 BACKPORT (#1665)
Branch: v4.4
https://github.com/10gen/docs-mongodb-internal/commit/fe29393d63de67778a155f8dfb1ce07249fe5695

Comment by Githook User [ 18/Aug/22 ]

Author:

{'name': 'Dave Cuthbert', 'email': '69165704+davemungo@users.noreply.github.com', 'username': 'davemungo'}

Message: DOCS-15433 BACKPORT (#1664)
Branch: v5.0
https://github.com/10gen/docs-mongodb-internal/commit/8b7bbbdbbcf26df89da63a7347a77bb78689f731

Comment by Githook User [ 17/Aug/22 ]

Author:

{'name': 'Dave Cuthbert', 'email': '69165704+davemungo@users.noreply.github.com', 'username': 'davemungo'}

Message: Docs 15433 secondary slowdown v6.0 (#1569)

  • Review feedback
Comment by Moustafa Maher [ 21/Jun/22 ]

the relevant code is here:

    auto numberOfThreads =
        std::min(replWriterThreadCount, 2 * static_cast<int>(ProcessInfo::getNumAvailableCores()));

And then we fix the minNumberOfThreads accordingly:

 options.minThreads = replWriterMinThreadCount < threadCount ? replWriterMinThreadCount : threadCount;
 options.maxThreads = static_cast<size_t>(threadCount);

So this "capped" part should be documented here https://www.mongodb.com/docs/manual/reference/parameters/#mongodb-parameter-param.replWriterThreadCount?

Generated at Thu Feb 08 08:12:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.