[SERVER-18079] Large performance drop with documents > 16k on Windows Created: 16/Apr/15 Updated: 29/Oct/15 Resolved: 04/May/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.0.1 |
| Fix Version/s: | 3.0.4, 3.1.3 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Mark Benvenuto |
| Resolution: | Done | Votes: | 0 |
| Labels: | FT | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | Windows | ||||||||||||||||||||||||
| Backport Completed: | |||||||||||||||||||||||||
| Sprint: | Platform 3 05/15/15 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
The Windows Heap is implemented in user-space as part of kernel32.dll. It divides the heap into two portions:
When WiredTiger needs to make many allocations >= 16KB performance on Windows will to suffer once the WT cache fills up. This is easily triggered by workloads with large documents. Original Description
Interestingly, the drop is much more pronounced with a 2-node replica set than it is with a 1-node replica set (write concern 1 in both cases). The stack traces confirm the contention and show why a 2-node replica set is worse: During the second half of the run there are significant periods when all 10 threads are waiting for the oplog deleter thread to reduce the size of the oplog:
The oplog deleter thread is held up by significant contention in the allocator, in 3 different places:
A primary source of contention seems to be the getmore that is tailing the oplog, which explains why a 2-node replica set is worse. Note that the oplog tail getmore is having to page in the entries even though they are brand new, which explains why it generates contention, but that seems non-optimal in itself, independent of the allocator issues.
Repro takes about 3 minutes (Windows on VMware instance, 6 cpus, 12 GB memory). Initialize 2-node replica set with these options:
Then run 10 threads of the following, which runs twice as long as it takes for the oplog to fill in order to generate a good comparison.
|
| Comments |
| Comment by Githook User [ 03/Jun/15 ] | ||||||||||||
|
Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}Message: | ||||||||||||
| Comment by Githook User [ 03/Jun/15 ] | ||||||||||||
|
Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}Message: (cherry picked from commit 0a2be248888a1c11fda2848682b54fd314ab162c) | ||||||||||||
| Comment by Githook User [ 03/Jun/15 ] | ||||||||||||
|
Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}Message: (cherry picked from commit 263a5d5b39091a87a553420dba5fb393902a2166) | ||||||||||||
| Comment by Githook User [ 03/Jun/15 ] | ||||||||||||
|
Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}Message: (cherry picked from commit b43f9663dd08d3f585a9ae9aa67476b5e1a9d07c) | ||||||||||||
| Comment by Chad Kreimendahl [ 14/May/15 ] | ||||||||||||
|
Is there no performance benefit or value in using TCMalloc for mongo's heap, too? | ||||||||||||
| Comment by Githook User [ 05/May/15 ] | ||||||||||||
|
Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}Message: | ||||||||||||
| Comment by Githook User [ 04/May/15 ] | ||||||||||||
|
Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}Message: | ||||||||||||
| Comment by Githook User [ 04/May/15 ] | ||||||||||||
|
Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}Message: | ||||||||||||
| Comment by Mark Benvenuto [ 24/Apr/15 ] | ||||||||||||
|
I have run 3 different benchmarks to understand the performance implicaitions of this change: Two instance replica set, 5 GB WT cache, 6 GB op log size, same machine (with 32 GB RAM), Windows 8.1 with each replica on separate SSDs.
| ||||||||||||
| Comment by Mark Benvenuto [ 21/Apr/15 ] | ||||||||||||
|
In answer to earlier inquiries about how does the document size affect the performance, I ran tests with the above repro with 1 KB, 2 KB, 4 KB, and 16 KB documents. The 1 KB documents show little degradation over time. As the sizes of the documents go up, the performance degradation increases, at about 4 KB, we see similar poor performance of the 16 KB document size scenario.
| ||||||||||||
| Comment by Mark Benvenuto [ 21/Apr/15 ] | ||||||||||||
|
There is one other option that eitan.klein suggested, use separate Windows heaps for MongoDB and WiredTiger code. This approach provides a little benefit as we spend less time fighting over the single lock that guards large allocations: This does not provide enough benefit to pursue as I still prefer one of the earlier TCMalloc approaches for WT allocations. | ||||||||||||
| Comment by Mark Benvenuto [ 20/Apr/15 ] | ||||||||||||
Problem DescriptionThe Windows Heap is implemented in user-space as part of kernel32.dll. It divides the heap into two portions:
WiredTiger does many allocations >= 16KB in Bruce's workload above which causes performance on Windows to suffer once the WT cache fills up. Below is a graph of allocation bucketed by powers of 2 from 2 to 1MB during the workload above. These are only the allocations done by WT, and not the rest of MongoDB. We can see there is a non-zero number of allocations.
Possible SolutionsThere are two possible solutions:
Considering that we already vendor TCMalloc, and use it on Linux, this is the logical heap to use instead of the Windows Heap. There are several different ways to include TCMalloc:
1. WT calls tc_mallocSince WT abstracts all allocations calls into os_alloc.c, we could change WT to call tc_malloc instead of malloc. In this choice, allocations made by mongodb code would continue to use TCMalloc, but allocations made by WT would use TCMalloc. This would mean that the process would have two heaps, a Windows heap, and a TCMalloc heap. 2. TCMalloc runtime patches mallocSince Windows does not provide an easy way to override the malloc function like Posix Libc, the TCMalloc library will find the addresses for all the various malloc related functions during startup, and replace them with calls to TCMalloc at runtime. This option means we have only heap. The disadvantage is that the TCMalloc disassembler depends on the version of the compiler we use. 3. MongoD is statically linked with a libcmt.lib which has no mallocChromium and Firefox both do not use the Windows Heap for memory allocations on Windows. Instead, they use TCMalloc, and JEMalloc respectively. They both use different tricks to statically link in their mallocs instead of the CRT's mallocs. The Chromium method is to create a modified libcmt.lib with malloc removed. The Firefox method is to rely on link order to include jemalloc's malloc, and modify an object file binary using a python script to do a string replace. I do not know if there are any license implications for these two approaches though. Summary
My preferences is either for option 1 or 3. There is still work to evaluate the functional and performance effects of these choices. CC: acm |