[SERVER-21652] Stalls during checkpoints Created: 24/Nov/15  Updated: 30/Mar/16  Resolved: 30/Nov/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: 3.2.0-rc5

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: WTplaybook
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File bigoplog.png     File diagnostic.data.tar     PNG File gdb.png     PNG File gdb2.png     PNG File incident1-checkpoint.png     PNG File incident1-evict-server.png     PNG File incident1-inserts.png     PNG File incident1-overview.png     PNG File incident1.png     PNG File stalls.png     PNG File standalone.png    
Issue Links:
Depends
depends on WT-2244 Trigger in-memory splits sooner Closed
is depended on by SERVER-21686 WiredTiger changes for 3.2.0-rc5 Closed
Related
related to SERVER-21585 Cache usage exceeds configured value,... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   
  • single-node replica set
  • 2 threads inserting 1 kB compressible documents
  • 10 GB oplog, 10 GB cache
  • ssd for data files, separate hdd for journal
  • using this build to avoid issues with LAS file from SERVER-21585

  • once cache fills inserts stall during latter part of checkpoint
  • ops with latencies up to 18779 ms are reported in mongod log
  • ftdc also stalls

A similar test on 3.0.7 shows some slowdown during checkpoints but no complete stalls, and max latency seen in a similar run was 7s. (The 3.0.7 test used 15 threads instead of 2 to get comparable throughput).

function repro(threads) {
 
    Random.setRandomSeed();
 
    db.c.ensureIndex({y:1})
 
    var docs = [];
    for (var i = 0; i < 1000; i++) {
        var x = '';
        while (x.length < 1000)
            x += 'x'
        docs.push({x: x, y: i, z: 0});
    }
 
    var insert_ops = [{
        op: 'insert',
        ns: 'test.c',
        doc: docs,
        safe: true,
    }];
 
    var res = benchRun({
        ops: insert_ops,
        seconds: 10000,
        parallel: threads
    });
    printjson(res);
 
}



 Comments   
Comment by Michael Cahill (Inactive) [ 30/Nov/15 ]

This should be fixed by WT-2244, in the latest WT drop.

Comment by Michael Cahill (Inactive) [ 25/Nov/15 ]

* why is the checkpoint so busy in __rec_txn_read - something about the state of the tree? something about the number or state of the transactions?

That is where we decide which versions of updates get written, and it has become more complicated since we added the "lookaside table" to WiredTiger. The job of the lookaside table is to allow WiredTiger to make progress when more updates have to be kept active to satisfy snapshot reads than fit in cache.

All of that said, we should dig into this to understand better why so much time is being spent in there. We may be able to optimize it for common cases where the lookaside table is not active to get back to 3.0 behavior.

* what are the inserts waiting for in __session_begin_transaction?

We wait in there if the cache is full. The reason for that is that after the transaction starts, it will pin updates, so we don't want to block a running transaction if we can avoid it. Blocking before the transaction starts avoids pathological cases where a full cache can make transactions run for longer, pinning more in cache.

* why is the range of pinned transactions growing to almost 1M during the stuck period, yet inserts are stalled and little to no transactions are beginning? Is this expected behavior?

I will have to look a little more closely to be sure, but we allocate transaction IDs internally in some cases, such as during splits and when writing to the lookaside table. I suspect one or both of those is contributing to this growth in transactions pinned.

* do all of the above point to some issue related to transactions?

Yes – this is an issue with eviction in the face of long-running transactions such as checkpoints. There were several changes made during the 3.0 cycle to avoid stalls like these and some of that code changed in 3.2 for various reasons. It looks like we have inadvertently lost at least one of the fixes.

Comment by Bruce Lucas (Inactive) [ 25/Nov/15 ]

I had a little trouble getting stack traces - at some point the process would just get very stuck, not sure why. Without a little more context it's hard for me to tell whether those stack traces were taken during a typical stall or an atypical period having something to do with gdb interference with the process. I'm seeing fairly different stack traces, so I'm not sure how what I'm looking at below relates to what you were looking at, e.g. how the server in general was behaving during the samples you took.

Here's a run I did on rc4 with 8 threads, gdbmon at 5 second intervals.

We can see things got really wedged at N, but looked normal before then, so I think the data through N is good. Zooming in on the range J-M when things looked ok and had a typical-looking stall from K-L:

Following stack trace timelines cover the same period J-M, with the stall around the middle from K-L, which we can identify by focusing the same portion of the timelines below as K-L is in the graph above:

The insert threads are all stuck waiting for something in __session_begin_transaction during that middle stuck period:

The eviction server also seems to be waiting for something:

The checkpoint thread seems to be executing (not waiting or doing i/o, based on the stack traces) in __rec_txn_read the whole duration of the stall:

Questions:

  • why is the checkpoint so busy in __rec_txn_read - something about the state of the tree? something about the number or state of the transactions?
  • what are the inserts waiting for in __session_begin_transaction?
  • why is the range of pinned transactions growing to almost 1M during the stuck period, yet inserts are stalled and little to no transactions are beginning? Is this expected behavior?
  • do all of the above point to some issue related to transactions?
Comment by Martin Bligh [ 25/Nov/15 ]

Cleaner profile

Comment by Martin Bligh [ 24/Nov/15 ]

Looks like mostly __wt_txn_id_alloc (presumably today's WT commit) and our old friend tcmalloc?
See gdb.png - I don't have bruce's Ninja picture attachment into a comment skills

Comment by Bruce Lucas (Inactive) [ 24/Nov/15 ]

Standalone run doubling the load (4 threads instead of 2) to keep total workload roughly the same as replica set case, still see througput stalls and long-latency ops (14s) once cache fills, but no ftdc stalls (so suspect those were mediated by the oplog in some way):

Comment by Bruce Lucas (Inactive) [ 24/Nov/15 ]

Increased the oplog size to 100 GB to test effect of oplog deletions. Problem developed when the cache filled, before oplog filled and truncates began at A, and did not seem to get particularly worse after the oplog filled, so doesn't seem to be related to oplog truncates.

Generated at Thu Feb 08 03:57:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.