[SERVER-19522] Capped collection insert rate declines over time under WiredTiger Created: 22/Jul/15  Updated: 24/Aug/15  Resolved: 17/Aug/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: 3.0.6, 3.1.7

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 19522.png     HTML File capped-delete.html     PNG File capped-delete.png     PNG File patch50.png    
Issue Links:
Depends
is depended on by WT-1973 MongoDB changes for WiredTiger 2.7.0 Closed
is depended on by SERVER-19744 WiredTiger changes for MongoDB 3.0.6 Closed
Related
is related to SERVER-19995 Performance drop-off when capped coll... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Participants:

 Description   

Capped collection insert rate begins to decline steadily once it is full and deletions start. Restarting mongod (without disturbing the collection) restores inserts to the original rate, but insert rate then begins to decline to a fraction of the initial rate:

20486 mb, 72577 ins/s
20489 mb, 63780 ins/s
20494 mb, 61213 ins/s
20495 mb, 55644 ins/s
20491 mb, 50556 ins/s
20494 mb, 46906 ins/s
20493 mb, 39926 ins/s
20496 mb, 36380 ins/s
20496 mb, 34873 ins/s
20496 mb, 43359 ins/s
20492 mb, 36704 ins/s
20491 mb, 25022 ins/s
20493 mb, 26334 ins/s
20494 mb, 27420 ins/s
20496 mb, 26594 ins/s
20493 mb, 25970 ins/s
20496 mb, 27202 ins/s
20496 mb, 25868 ins/s
20492 mb, 26571 ins/s
20491 mb, 26460 ins/s
20496 mb, 26840 ins/s
20496 mb, 26273 ins/s
20491 mb, 25883 ins/s
20496 mb, 25342 ins/s
20491 mb, 25477 ins/s
20496 mb, 24753 ins/s
20496 mb, 24196 ins/s
20492 mb, 24092 ins/s
20491 mb, 23656 ins/s
20496 mb, 23656 ins/s
20495 mb, 23213 ins/s
20496 mb, 22776 ins/s
20496 mb, 23280 ins/s
20495 mb, 22710 ins/s
20496 mb, 22352 ins/s
20492 mb, 22554 ins/s

Seems to be worse with larger documents and with larger collections:

doc size    cap      initial    final   final/
                     rate       rate    initial
200         2GB      194        160     82%
1000        10GB     140        84      60%
1000        20GB     140        63      45%
2000        20GB     115        50      43%
4000        20GB     83         23      27%

Stack trace excepts showing involvement of WT delete (complete stack traces attached as capped-delete.html):



 Comments   
Comment by Ramon Fernandez Marina [ 17/Aug/15 ]

The capped collection insert rate decline over time has been fixed so I'm resolving this issue.

I've created SERVER-19995 to investigate the performance drop of capped deletes when the collection is full

Comment by Michael Cahill (Inactive) [ 12/Aug/15 ]

bruce.lucas@10gen.com, I am inclined to resolve the original issue reported here as fixed in 3.0.6 and 3.1.7. That is, ignoring the drop-off when the capped collection becomes full, the performance over time is now stable.

I'm happy to investigate the impact of the capped deletes further (though I don't have any good ideas at the moment), but I think that should have a separate ticket. Does that seem reasonable to you?

Comment by Githook User [ 04/Aug/15 ]

Author:

{u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

Message: Merge pull request #2091 from wiredtiger/intl-page-empty

SERVER-19522 Try to evict internal pages with no useful child pages.
(cherry picked from commit fb8739f423a824b76e1a26cbaa28c1edaa2f2f7d)
Branch: mongodb-3.0
https://github.com/wiredtiger/wiredtiger/commit/000ff74c23e153a3802cbf04523e72a08f00e3e3

Comment by Bruce Lucas (Inactive) [ 30/Jul/15 ]

My test run was standalone; everything else is as described above (12 core / 24 cpu machine, 64 GB mem, 24 threads of the workload). Any differences in throughput from my tests presumably are due to some difference in configuration or test parameters.

In particular my test was entirely CPU bound: the 20 GB capped collection fits in the default 32 GB cache. You can also see this from the stack traces I attached: when deletions begin, 24 threads spend much of their time in pthread_mutex_timedlock called from cappedDeleteAsNeeded, waiting for a single thread to do the deletions, and that single thread is clearly CPU-bound - spending almost all its time in __wt_tree_walk, and much of that time in __wt_evict, with almost no i/o among those stack traces. So it seems if it were possible to eliminate the evictions that would provide improvement.

I also wonder if it might be possible to improve things by allowing more than one thread to do the deletions - for example, n threads each deleting every nth record of a range, to provide a better match for the n threads doing insertion. TBD I guess how much contention would limit the speedup.

Comment by Martin Bligh [ 30/Jul/15 ]

Only reason I can see to do it in WT is if you can think of some way to make lookups faster than having to do N lookups when we partition into N pieces.
Or if you keep child record cumulative counters / sum size?

Comment by Michael Cahill (Inactive) [ 30/Jul/15 ]

I don't see any way to avoid I/O if capped collections are larger than cache (as in the 20GB case here) because we need to read the tail in order to unindex records.

Yes, partitioning would fix this but would have to be done at a higher level than the current storage engine API, I think. FWIW, I believe that TokuMX has this kind of partitioning for exactly this reason.

Comment by Martin Bligh [ 30/Jul/15 ]

I'm guessing when we start capping the collection we're doing a bunch of read I/O and our working set is < WT cache size by the time we've done indexing etc?
If we're touching the disk, dropoff seems expected, if we're not, seems very odd ?
I don't think we have a good way figured out to avoid reading the back end of the capped collection, except for the oplog special case.
If we partition the whole thing into segments, that'd fix it, but at the cost of much slower lookups ?

Comment by Michael Cahill (Inactive) [ 30/Jul/15 ]

martin.bligh, Alex was running without oplog: I suspect the differences were down to the relative CPU and I/O speeds on his test machine vs Bruce's.

I haven't had a chance to look into this more, but I think this workload would go faster if the scan avoided evicting dirty pages, since we're just about to mark those pages deleted. I'll see what I can do, but at this point it will probably be early next week before I can really figure out what's possible.

Comment by Martin Bligh [ 30/Jul/15 ]

Are you guys running with or w/o oplog?

Comment by Alexander Gorrod [ 30/Jul/15 ]

bruce.lucas@10gen.com I've been running this workload and don't see as significant a drop off as you once capping starts happening. The drop off I see is around 30%. Having said that my throughput rate appears lower than yours from the start (around 17k ops/sec).

The script I'm using to test the performance is based on the snippets you included above:

item_size = 4000                                                                
                                                                                
function create() {                                                             
    db.c.drop()                                                                 
    db.createCollection('c', {capped:true, size:20*1024*1024*1024})             
}                                                                               
                                                                                
function insert() {                                                             
                                                                                
    x = ''                                                                      
    for (var i=0; i<item_size; i++)                                             
        x += 'x'                                                                
                                                                                
    last_time = Date.now();                                                     
    every = 10000                                                               
    for (var i=0; ; ) {                                                         
        var bulk = db.c.initializeUnorderedBulkOp();                            
        for (var j=0; j<every; j++, i++)                                        
            bulk.insert({x:x});                                                 
        bulk.execute();                                                         
        if (i % 100000 == 0) {                                                  
                size = (i * item_size) / (1024 * 1024 * 1024)                   
                this_time = Date.now();                                         
                elapsed_ms = this_time - last_time;                             
                ops_sec = 100000 / (elapsed_ms / 1000);                         
                print(i + " at time: " + elapsed_ms + " ms. ops/sec: " + ops_sec.toFixed(2) + " inserted: " + size.toFixed(2) + " GB")
                last_time = this_time;                                          
        }                                                                       
    }                                                                           
}                                                                                                                                                               
create()                                                                                                                                                       
insert() 

From looking at the profiling and the implementation of capped collection deletes, that's the sort of performance difference I expect to see. I agree that there is work we could do on improving the performance of capped collections, it feels as though that belongs in a different ticket to me.

Comment by Daniel Pasette (Inactive) [ 29/Jul/15 ]

Author:

{u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'}

Message: SERVER-19522 Cleanup properly on error when reconciling internal pages.

We were leaving the WT_PAGE_SPLIT_LOCKED flag set.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/31058f483e0c34d9c141b83352481a18ae337eb8

Comment by Githook User [ 28/Jul/15 ]

Author:

{u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

Message: Merge pull request #2091 from wiredtiger/intl-page-empty

SERVER-19522 Try to evict internal pages with no useful child pages.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/fb8739f423a824b76e1a26cbaa28c1edaa2f2f7d

Comment by Githook User [ 28/Jul/15 ]

Author:

{u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}

Message: WT-2024: acquire the WT_PAGE_SCANNING lock in all cases, __wt_update_serial
doesn't acquire the page lock when scanning for obsolete updates.

SERVER-19522: lift the WT_PAGE_SPLIT_LOCKED flag out of the __rec_row_int
and __rec_col_int functions, lock at the top-level when acquiring the rest
of our locks.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/e2cdf33370da23e5198e89e27b7d64892c5831c8

Comment by Githook User [ 28/Jul/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-19522 Try to evict internal pages with no useful child pages.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/3f231c090c623107a93b44cd1975e7fbf17f5448

Comment by Alexander Gorrod [ 28/Jul/15 ]

bruce.lucas@10gen.com I have reproduced the problem on 3.0. After some debugging it appears as though the problem is that WiredTiger ends up with a lot of internal pages with only references to deleted pages at the start of a tree. Walking those pages during the capped delete is slow, which causes the significant slowdown you see. We are working on a fix.

Comment by Bruce Lucas (Inactive) [ 24/Jul/15 ]

Verified after applying the patch for SERVER-19535 that this problem exists in 3.1.6 as well.

Comment by Michael Cahill (Inactive) [ 23/Jul/15 ]

bruce.lucas@mongodb.com, there is an attempt to clean up internal pages as well as leaf pages but it is possible that code isn't working well for this workload.

What the stacks suggest is that the cursor walk that figures out the next truncation point is spending a lot of time doing (pointless) eviction. That causes other threads to be throttled so that the oplog size is kept under control.

All of that code is eliminated by martin.bligh's work on "oplogStones" in SERVER-18875 and https://mongodbcr.appspot.com/6570003/ – can you retest with that patch, or after those changes are merged, to see whether this workload performs as expected (and keeps the oplog size bounded)?

Comment by Bruce Lucas (Inactive) [ 22/Jul/15 ]

Above stack traces and the numbers showing declining rate were obtained with 20 GB capped collection, 4 KB documents, 24 threads, on a machine with 12 cores / 24 cpus, 64 GB memory. Code:

function create() {
    db.c.drop()
    db.createCollection('c', {capped:true, size:20*1024*1024*1024})
}
 
function insert() {
 
    x = ''
    for (var i=0; i<4000; i++)
        x += 'x'
 
    every = 10000
    for (var i=0; ; ) {
        var bulk = db.c.initializeUnorderedBulkOp();
        for (var j=0; j<every; j++, i++)
            bulk.insert({x:x});
        bulk.execute();
        //print(i)
    }
}

Generated at Thu Feb 08 03:51:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.