[SERVER-31704] Periodic drops in throughput during checkpoints while waiting on schema lock Created: 24/Oct/17 Updated: 17/Aug/23 Resolved: 24/Sep/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.4.9 |
| Fix Version/s: | 4.3 Desired |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kelsey Schubert | Assignee: | Brian Lane |
| Resolution: | Duplicate | Votes: | 13 |
| Labels: | customer-mgmt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Steps To Reproduce: | Run mix.js
|
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||
| Description |
|
Periodic drops in throughput can be observed during checkpoints while running with an artificial workload and specific WiredTiger tuning parameters enabled. I've attached diagnostic.tar.gz
After reaching steady state, we see the following: Additionally, from perf, we can see that after making this parameter change, a single cpu core (12.5% of the total) is consistently fully utilized running eviction:
|
| Comments |
| Comment by Eric Milkie [ 15/Aug/20 ] |
|
|
| Comment by Chad Kreimendahl [ 14/Aug/20 ] |
|
Will |
| Comment by Ralf Strobel [ 16/Jun/20 ] |
|
|
| Comment by Andy Pang [ 10/Jun/20 ] |
|
I don't see the in question macro listed here. Here are our patches: https://github.com/apang-ns/mongo/commit/1aea123a49f6eed873694a8d5dd9a46f903a3dde https://github.com/apang-ns/mongo/commit/b23b9b1a0a079e9835a3e16a91ddaaaaacc83014 Percona has a better version of this patch they included in their 3.6 release train if you're looking to experiment: https://www.percona.com/doc/percona-server-for-mongodb/3.6/release_notes/3.6.18-6.0.html As mentioned in this thread, |
| Comment by Firass Almiski [ 09/Jun/20 ] |
|
So which of these options did you tweak Andy? https://source.wiredtiger.com/mongodb-3.4/group__wt.html#ga8ca567b2908997280e4f0a20b80b358b
I'd like to give that a try. I'm seeing issues that I believe are related to dhandle as well. |
| Comment by Andy Pang [ 09/Jun/20 ] |
|
For high data handle count workloads, we've found that WT spends 50% of checkpoint prepare time traversing dhandles, and during this time requests from app threads are not serviced. Our fix was to increase the hash array size for dhandle (currently hardcoded to in WT 512 buckets) for more efficient lookup. We were able to improve performance by 50%+ during checkpoint prepare (the time when the schema lock is held) by eliminating the traversal time using the existing hash mechanism. |
| Comment by Brian Lane [ 24/Mar/20 ] |
|
Hi falmiski@creativeradicals.com, There currently isn't an ETA for this one - I had wanted to spend some time digging into it before the upcoming 4.4 release, but I expect it will need to wait until after that release is finished. Could you give me more details on which version you are using and the issue you are experiencing? |
| Comment by Firass Almiski [ 23/Mar/20 ] |
|
Requesting update, once again |
| Comment by Firass Almiski [ 18/Mar/20 ] |
|
ETA of this fix? Please bump the priority if at all possible... |
| Comment by Brian Lane [ 18/Feb/20 ] |
|
dmitry.agranat could you perhaps re-run this with a version that has |
| Comment by Kelsey Schubert [ 01/Nov/17 ] |
|
I don't believe this is new as I see similar behavior in MongoDB 3.4.4. Unfortunately, given the nature of the workload, I wasn't able to determine whether 3.4.0 is similarly affected. |
| Comment by Alexander Gorrod [ 24/Oct/17 ] |
|
anonymous.user Do you know if this symptom is new, or if it was present in previous releases of MongoDB? |
| Comment by Kelsey Schubert [ 24/Oct/17 ] |
|
alex.komyagin encountered this issue during some testing around tuning workloads with many collections. |