[SERVER-18079] Large performance drop with documents > 16k on Windows Created: 16/Apr/15  Updated: 29/Oct/15  Resolved: 04/May/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.1
Fix Version/s: 3.0.4, 3.1.3

Type: Bug Priority: Critical - P2
Reporter: Bruce Lucas (Inactive) Assignee: Mark Benvenuto
Resolution: Done Votes: 0
Labels: FT
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File DualWinHeap.png     PNG File WinHeapBigDocs.png     PNG File WinHeapVariousDocSizes.png     PNG File drop.png     PNG File stacks-deleter.png     PNG File stacks-oplog-tail.png     PNG File stacks-waiters.png     PNG File sysbench.png     PNG File wt_mem_alloc.png     PNG File ycsb_1k.png     PNG File ycsb_4k.png    
Issue Links:
Depends
is depended on by WT-1904 Poor allocator performance on Windows... Closed
Related
related to SERVER-18375 High CPU during insert-only stress te... Open
related to SERVER-18081 Tailing the oplog requires paging in ... Closed
is related to SERVER-17495 Stand alone mongod throughout dropped... Closed
Backwards Compatibility: Fully Compatible
Operating System: Windows
Backport Completed:
Sprint: Platform 3 05/15/15
Participants:

 Description   

The Windows Heap is implemented in user-space as part of kernel32.dll. It divides the heap into two portions:

  • the Low-Fragmentation Heap (LFH) which is optimized for allocations < 16 KB
  • the large allocation portion which takes a CriticalSection for each allocation and free.

When WiredTiger needs to make many allocations >= 16KB performance on Windows will to suffer once the WT cache fills up. This is easily triggered by workloads with large documents.
The solution we're pursuing is to use tc_malloc in place of the system allocator for WT on windows.
See below for more details and the original bug report.

Original Description

  • At D performance drops and remains low for the second half of the run (first row).
  • This coincides with the oplog reaching capacity (second row).
  • It also coincides with an increase in heap lock contention (third row), presumably related to the increased memory allocator activity caused by deleting entries from the oplog. Note that heap lock contention has about doubled in spite of the operation rate having fallen by about 4x.
  • A couple of other momentary dips in performance (A, B, C) appear to be related to disk activity (last two rows), and in fact heap lock contention fell during those dips because of the reduced op rate.

Interestingly, the drop is much more pronounced with a 2-node replica set than it is with a 1-node replica set (write concern 1 in both cases). The stack traces confirm the contention and show why a 2-node replica set is worse:

During the second half of the run there are significant periods when all 10 threads are waiting for the oplog deleter thread to reduce the size of the oplog:

The oplog deleter thread is held up by significant contention in the allocator, in 3 different places:

A primary source of contention seems to be the getmore that is tailing the oplog, which explains why a 2-node replica set is worse. Note that the oplog tail getmore is having to page in the entries even though they are brand new, which explains why it generates contention, but that seems non-optimal in itself, independent of the allocator issues.

Repro takes about 3 minutes (Windows on VMware instance, 6 cpus, 12 GB memory). Initialize 2-node replica set with these options:

    mongod --storageEngine wiredTiger --wiredTigerCacheSizeGB 5 --oplogSize 6000 ...

Then run 10 threads of the following, which runs twice as long as it takes for the oplog to fill in order to generate a good comparison.

size = 16000
w = 1
batch = 10
 
function insert() {
    
    x = ''
    for (var i=0; i<size; i++)
        x = x + 'x'
 
    finish = 0
    start = t = ISODate().getTime()
    for (var i=0; finish==0 || t<finish; ) {
        var bulk = db.c.initializeUnorderedBulkOp();
        for (var j=0; j<batch; j++, i++)
            bulk.insert({x:x})
        bulk.execute({w:w});
        t = ISODate().getTime()
        if (i>0 && i%1000==0) {
            var s = db.getSiblingDB('local').oplog.rs.stats(1024*1024)
            print(i, s.size)
            if (finish==0 && s.size>s.maxSize*0.9) {
                finish = start + 2*(t-start)
                print('finishing in', finish-t, 'ms')
            }
        }
    }
}



 Comments   
Comment by Githook User [ 03/Jun/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-18079: Use TCMalloc for WT on Windows
Branch: v3.0
https://github.com/mongodb/mongo/commit/7e62741800a6e458ab02b9da91e5c37d389c8eae

Comment by Githook User [ 03/Jun/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-18079: Use TCMalloc for WT on Windows

(cherry picked from commit 0a2be248888a1c11fda2848682b54fd314ab162c)
Branch: v3.0
https://github.com/mongodb/mongo/commit/b677e49bed78c415498102a6d7d1cfbed43e76f7

Comment by Githook User [ 03/Jun/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-18079: Use TCMalloc for WT on Windows

(cherry picked from commit 263a5d5b39091a87a553420dba5fb393902a2166)
Branch: v3.0
https://github.com/mongodb/mongo/commit/7662238b18a136fbad09e45163adff20d32fd1be

Comment by Githook User [ 03/Jun/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-18079: Use TCMalloc for WT on Windows

(cherry picked from commit b43f9663dd08d3f585a9ae9aa67476b5e1a9d07c)
Branch: v3.0
https://github.com/mongodb/mongo/commit/99767cdffb16203c5f3190560614b1a166c16bb3

Comment by Chad Kreimendahl [ 14/May/15 ]

Is there no performance benefit or value in using TCMalloc for mongo's heap, too?

Comment by Githook User [ 05/May/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-18079: Use TCMalloc for WT on Windows
Branch: master
https://github.com/mongodb/mongo/commit/0a2be248888a1c11fda2848682b54fd314ab162c

Comment by Githook User [ 04/May/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-18079: Use TCMalloc for WT on Windows
Branch: master
https://github.com/mongodb/mongo/commit/263a5d5b39091a87a553420dba5fb393902a2166

Comment by Githook User [ 04/May/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-18079: Use TCMalloc for WT on Windows
Branch: master
https://github.com/mongodb/mongo/commit/b43f9663dd08d3f585a9ae9aa67476b5e1a9d07c

Comment by Mark Benvenuto [ 24/Apr/15 ]

I have run 3 different benchmarks to understand the performance implicaitions of this change:

Two instance replica set, 5 GB WT cache, 6 GB op log size, same machine (with 32 GB RAM), Windows 8.1 with each replica on separate SSDs.

  1. YCSB 4kb documents. This workload is I/O bound on my machine. The spikes on Windows are because of write stalls which enable us to get better read rates. YCSB
  2. YCSB 1kb documents, TC Malloc is a clear win.
  3. SysBench, TCMalloc is a clear win in this case also
Comment by Mark Benvenuto [ 21/Apr/15 ]

In answer to earlier inquiries about how does the document size affect the performance, I ran tests with the above repro with 1 KB, 2 KB, 4 KB, and 16 KB documents. The 1 KB documents show little degradation over time. As the sizes of the documents go up, the performance degradation increases, at about 4 KB, we see similar poor performance of the 16 KB document size scenario.

Comment by Mark Benvenuto [ 21/Apr/15 ]

There is one other option that eitan.klein suggested, use separate Windows heaps for MongoDB and WiredTiger code.

This approach provides a little benefit as we spend less time fighting over the single lock that guards large allocations:

This does not provide enough benefit to pursue as I still prefer one of the earlier TCMalloc approaches for WT allocations.

Comment by Mark Benvenuto [ 20/Apr/15 ]

Problem Description

The Windows Heap is implemented in user-space as part of kernel32.dll. It divides the heap into two portions:

  1. the Low-Fragmentation Heap (LFH) which is optimized for allocations < 16 KB
  2. the large allocation portion which takes a CriticalSection for each allocation and free.

WiredTiger does many allocations >= 16KB in Bruce's workload above which causes performance on Windows to suffer once the WT cache fills up.

Below is a graph of allocation bucketed by powers of 2 from 2 to 1MB during the workload above. These are only the allocations done by WT, and not the rest of MongoDB. We can see there is a non-zero number of allocations.

Possible Solutions

There are two possible solutions:

  1. Do nothing
  2. Change the Heap we use MongoDB and/or WiredTiger

Considering that we already vendor TCMalloc, and use it on Linux, this is the logical heap to use instead of the Windows Heap. There are several different ways to include TCMalloc:

  1. WT calls tc_malloc functions
  2. TCMalloc runtime patches malloc
  3. MongoD is statically linked with a libcmt.lib which has no malloc

1. WT calls tc_malloc

Since WT abstracts all allocations calls into os_alloc.c, we could change WT to call tc_malloc instead of malloc. In this choice, allocations made by mongodb code would continue to use TCMalloc, but allocations made by WT would use TCMalloc. This would mean that the process would have two heaps, a Windows heap, and a TCMalloc heap.

2. TCMalloc runtime patches malloc

Since Windows does not provide an easy way to override the malloc function like Posix Libc, the TCMalloc library will find the addresses for all the various malloc related functions during startup, and replace them with calls to TCMalloc at runtime. This option means we have only heap. The disadvantage is that the TCMalloc disassembler depends on the version of the compiler we use.

3. MongoD is statically linked with a libcmt.lib which has no malloc

Chromium and Firefox both do not use the Windows Heap for memory allocations on Windows. Instead, they use TCMalloc, and JEMalloc respectively. They both use different tricks to statically link in their mallocs instead of the CRT's mallocs. The Chromium method is to create a modified libcmt.lib with malloc removed. The Firefox method is to rely on link order to include jemalloc's malloc, and modify an object file binary using a python script to do a string replace.

I do not know if there are any license implications for these two approaches though.

Summary

Option Heap Method
1 Both tc_ calls
2 TCMalloc Runtime patching
3 TCMalloc Static Link

My preferences is either for option 1 or 3. There is still work to evaluate the functional and performance effects of these choices.

CC: acm

Generated at Thu Feb 08 03:46:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.