[SERVER-37849] Poor replication performance and cache-full hang on secondary due to pinned content Created: 31/Oct/18 Updated: 21/Jan/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Backlog - Storage Engines Team |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Assigned Teams: |
Storage Engines
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||
| Description |
|
This is a follow-on to
|
| Comments |
| Comment by Bruce Lucas (Inactive) [ 07/Nov/18 ] | ||||||||||||
|
I've attached a new reproducer, repro-10MBx8-push-fast.sh. It removes the numactl that restricted mongod to a single core, and reduces the number of update operations to allow the script to be run multiple times in succession more quickly. The script waits for the updates to be replicated and exits when they have, so that it can be run in a loop until it hangs. Generally I've found that there's a fairly high likelihood of it hanging on any run of the script. | ||||||||||||
| Comment by Bruce Lucas (Inactive) [ 07/Nov/18 ] | ||||||||||||
|
I've attached another repro script that updates 8 documents instead of 2. This creates more parallelism on the secondary so may crate more opportunity for hangs if there are race conditions involved. Here's a hang using this script:
It seems unusual that it is hung with a reported dirty content of essentially 0 (0.026 MB, 1 page). The total (non-dirty) cache content is 327% so maybe that's why the threads are waiting, but why isn't that content being evicted so they can proceed? FTDC data also attached as hang8.zip All 8 oplog applier threads (one for each of the 8 documents) are stuck in the same place, waiting for the cache:
|