[SERVER-20876] Hang in scenario with sharded ttl collection under WiredTiger Created: 12/Oct/15  Updated: 10/Dec/15  Resolved: 12/Nov/15

Status: Closed
Project: Core Server
Component/s: Performance, WiredTiger
Affects Version/s: 3.0.6, 3.0.7
Fix Version/s: 3.0.8

Type: Bug Priority: Critical - P2
Reporter: Bruce Lucas (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File diagnostic.data.tar     HTML File gdbmon.html     Text File gdbmon.log     HTML File repro-04-gdbmon-s0r0.html     Text File repro-04-gdbmon-s0r0.log     File repro-06-diagnostic.data.tar     HTML File repro-06-gdbmon.html     PNG File repro-06-gdbmon.png     PNG File repro-06-ts.png     PNG File repro-12.png     File repro-14-diagnostic.data.tar     HTML File repro-14-gdbmon.html     PNG File repro-14-gdbmon.png     PNG File repro-14.png     PNG File repro-15.png     File standalone-diagnostic.data.tar     PNG File standalone.png     PNG File stuck.png    
Issue Links:
Related
related to SERVER-20877 Under cache-full conditions serverSta... Closed
is related to SERVER-20996 74% performance regression in db.c.re... Closed
is related to SERVER-21027 Reduced performance of index lookups ... Closed
Backwards Compatibility: Fully Compatible
Participants:

 Description   

Setup:

  • 3.0.6
  • two shards, each standalone mongod
  • shard key is _id (not hashed), using standard default ObjectIDs.
  • WT cache size 10 GB
  • single ttl collection set to expire documents after 10 minutes
  • 4 threads inserting 1 kB documents into the ttl collection

After about 25 minutes of run time (about 15 minutes after ttl deletions started) one of the shards became completely stuck. Stack traces captured after it became stuck are attached:

  • TTLMonitor, splitVector, serverStatus are all stuck waiting for full cache:
    • TTLMonitor is stuck in __wt_page_in_func while traversing the collection
    • serverStatus and splitVector are both stuck in WT somewhere within mongo::WiredTigerRecoveryUnit::_txnOpen - why would a transaction open be stuck waiting for a full cache?
  • no sign of any eviction activity (or of the eviction worker threads, but maybe they're hidden in the "??" call sites?).

Here's an overview of the run for the shard that became stuck:

  • inserts are fast but unbalanced because only one shard at a time is active in this particular test (due to simple shard key).
  • insert rate drops very dramatically when ttl deletions begin at A. This performance hit seems suprisingly large.
  • ttl deletion passes run A-B, C-D, E-F, G-. Each begins with a bump in "ttl passes" and end with a bump in "ttl deletedDocuments"
  • WT range of IDs pinned correlates well with the deletion passes. However I don't believe there are intentional long-running transactions, but rather I suspect one or more deletions just run very slowly.
  • it appears that at H the cache becomes full and everything gets stuck, presumably because there are no evictions happening as per the stack traces above which were captured after H.
  • interestingly though evictions appear to have been happening as normal right up to H - so why would they stop?
  • there is no data after H because serverStatus has become stuck as well.
  • the only clear connection to any sharding-related activity that I can find is the stuck splitVector call; however it has been running frequently and seems to be stuck in the same way serverStatus is, so it seems unlikely it is a culprit.


 Comments   
Comment by Githook User [ 13/Nov/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-20876 Backport seekExact to 3.0 (lint)
Branch: v3.0
https://github.com/mongodb/mongo/commit/eb905b18f7987dc2e67bc532895013f66ca723e8

Comment by Githook User [ 12/Nov/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-20876 Backport seekExact to 3.0
Branch: v3.0
https://github.com/mongodb/mongo/commit/2a168d1195afebdfe9357c2f0be9144186514a16

Comment by Bruce Lucas (Inactive) [ 20/Oct/15 ]

michael.cahill, agreed that changing how _id lookups are done to avoid the slow search_near fixes the particular scenario that this ticket reports. I opened SERVER-21027 to continue the investigation of the underlying issue with search_near. Ideally we would get an understanding of the cause of that issue to have a better idea of the scope of impact.

Comment by Kaloian Manassiev [ 19/Oct/15 ]

In response to bruce.lucas's question to renctan above, I just checked the code and the only operations in sharding receive chunk, which explicitly create WUOWs are the initial index and collection creation (both cases no data), so they should be very quick operations.

The call stack above, which includes sharding migration lines is the receive part, which does one transaction per document.

Comment by Bruce Lucas (Inactive) [ 19/Oct/15 ]

I believe Michael was referring to the "transaction rage of IDs pinned" statistic, for example in this graph from above:

Note how it climbs linearly while the TTL remove calls are executing, for the duration of the execution. If I understand correctly this is a sign that some transaction is open that requires some transactions to be pinned, presumably the write transactions for the TTL in this case (because that's all that's happening in that graph at that rate). The question is what transaction is open that requires the write transactions to be pinned: Is the TTL index query by any chance done as a single transaction that is open as we do the deletes? Is that transaction closed when we yield? In some cases in fact I have seen the range of ids pinned reset to 0 during the middle of a TTL pass, giving a "double sawtooth" for that pass (and I think you can just barely see that happening immediately after A in that graph). Would this indicate that the TTL pass yielded at that point? Is the issue maybe that we aren't yielding enough?

(To be clear, whatever issues that might cause, I believe that is a separate issue and is not responsible for the problems reported on this ticket, because a) those problems occur long after the TTL activity has ended and no transactions are pinned, and b) the problems seen here occur whether the documents are deleted by TTL or by ordinary application removes).

Comment by Geert Bosch [ 19/Oct/15 ]

michael.cahill, there should never be many deletes happen in a single transaction. Each WriteUnitOfWork should commit separately and thus cause a new transaction to be started. Separately, we do yielding every so often. Can you elaborate on why you think 3M objects are deleted in a single txn?

Comment by Michael Cahill (Inactive) [ 19/Oct/15 ]

bruce.lucas, I agree that there is more work to do, but given that sharding does not do lookups in indexes other than _id, the patch build should address the most common cases where this issue has been seen.

Comment by Bruce Lucas (Inactive) [ 19/Oct/15 ]

Don't we also need to look into the underlying reason the search_near becomes slow after deleting records from the collection, whether with TTL or with ordinary removes?

The patch fix indeed avoids that issue for _id lookups, but not for more general index operations. A slight tweak to the previous repro to use a (non-unique) secondary index shows the same problem with lookups done via that index; run using the patch build:

Typical stack during the slower index lookups after the removes shows same underlying issue with slow search_near, due to tree walks, yet the table is emtpy at this point:

Thread 56 (Thread 0x7f03c627a700 (LWP 32334)):
#0  0x0000000001318a34 in __wt_tree_walk ()
#1  0x00000000012ec9f6 in __wt_btcur_next ()
#2  0x00000000012f6a5a in __wt_btcur_search_near ()
#3  0x00000000013373ba in ?? ()
#4  0x0000000000d69b42 in ?? ()
#5  0x0000000000d6ce23 in ?? ()
#6  0x0000000000a8313d in mongo::BtreeIndexCursor::seek(mongo::BSONObj const&, bool) ()
#7  0x00000000009f6949 in mongo::IndexScan::initIndexScan() ()
#8  0x00000000009f6d1d in mongo::IndexScan::work(unsigned long*) ()
#9  0x00000000009ea53d in mongo::FetchStage::work(unsigned long*) ()
#10 0x00000000009f86a4 in mongo::LimitStage::work(unsigned long*) ()
#11 0x0000000000bc6414 in mongo::PlanExecutor::getNextSnapshotted(mongo::Snapshotted<mongo::BSONObj>*, mongo::RecordId*) ()
#12 0x0000000000bc67c4 in mongo::PlanExecutor::getNext(mongo::BSONObj*, mongo::RecordId*) ()
#13 0x0000000000b94424 in mongo::runQuery(mongo::OperationContext*, mongo::Message&, mongo::QueryMessage&, mongo::NamespaceString const&, mongo::CurOp&, mongo::Message&) ()
#14 0x0000000000aa4120 in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
#15 0x00000000009cbf70 in mongo::DBDirectClient::call(mongo::Message&, mongo::Message&, bool, std::string*) ()
#16 0x0000000000854420 in mongo::DBClientCursor::init() ()
#17 0x0000000000840f58 in mongo::DBClientBase::query(std::string const&, mongo::Query, int, int, mongo::BSONObj const*, int, int) ()
#18 0x00000000009cbcb7 in mongo::DBDirectClient::query(std::string const&, mongo::Query, int, int, mongo::BSONObj const*, int, int) ()
#19 0x0000000000ee148d in mongo::mongoFind(mongo::V8Scope*, v8::Arguments const&) ()
#20 0x0000000000ec93a8 in mongo::V8Scope::v8Callback(v8::Arguments const&) ()

Repro follows; no TTL, no sharding, nothing pathological - just remove stuff from a collection and then do lookups via a secondary index.

db=/ssd/db
 
function start {
    killall -9 -w mongod
    rm -rf $db
    mkdir -p $db/r0
    mongod --storageEngine wiredTiger --dbpath $db/r0 --logpath $db/mongod.log --fork
    sleep 5
}
 
function create {
    mongo --eval "db.createCollection('c')"
    mongo --eval "db.c.createIndex({x:1})"
}
 
function insert {
    threads=24
    for t in $(seq $threads); do
        mongo --eval "
            count = 2000000
            every = 10000
            for (var i=0; i<count; ) {
                var bulk = db.c.initializeUnorderedBulkOp();
                for (var j=0; j<every; j++, i++)
                    bulk.insert({x:i})
                bulk.execute();
                print(i)
            }
        " &
    done
    wait
}
 
function remove {
    mongo --eval "db.c.remove({})"
}
 
function find() {
    mongo --eval "
        n = 500000
        t = new Date()
        db.eval(function(n) {
            q = {x: 0}
            for (var i=0; i<n; i++)
                db.c.findOne(q)
        }, n)
        t = new Date() - t
        print(Math.floor(1000*n/t), 'q/s')
    "
}
 
start
create; find; find; find           # newly created empty collection
insert; remove; find; find; find   # fill, empty, then measure find

Comment by Michael Cahill (Inactive) [ 19/Oct/15 ]

Running bruce.lucas's repro, I see an ~85% slowdown between the initial "find" steps and the steps after the insert/remove.

With the patch build mentioned above, there is no slowdown. I'll push that change into CR.

It is still important to understand which operation in MongoDB is keeping a transaction pinned across millions of TTL remove operations, because that is likely to cause other problems as well, but the change from search_near to search for _id lookups does seem to address the symptom identified here.

Comment by Bruce Lucas (Inactive) [ 18/Oct/15 ]

Above was with 3.0.7. I tried it on 3.2.0-rc0, but ran into SERVER-20996, which caused the remove() in this test to take a very long time and I could see the rate of document removal continuing to decline as it ran, so I didn't wait for the run to complete.

Comment by Bruce Lucas (Inactive) [ 18/Oct/15 ]

Simple standalone repro:

  • create an empty collection, measure the performance of findOne() using the _id index
  • fill the collection with some number of documents, remove them all using db.c.remove({}), then measure the performance of findOne() again. It is now much slower, and gdb shows it's spending all its time in __wt_tree_walk,__wt_btcur_prev,__wt_btcur_search_near.

  • A-B: repeated findOne using _id index on initial empty collection
  • B-C: filling collection
  • C-D: emptying collection
  • D-F: repeated findOne using _id index on filled/emptied collection. Note much lower query rate.
  • D-E: it did appear to evict pages with deleted items at the start, but apparently something has remained that requires an expensive tree walk.

Stack samples covering same time period as above, showing that the difference between the first and the second set of findOnes is the tree walk:

Here's the repro script:

db=/ssd/db
 
function start {
    killall -9 -w mongod
    rm -rf $db
    mkdir -p $db/r0
    mongod --storageEngine wiredTiger --dbpath $db/r0 --logpath $db/mongod.log --fork
    sleep 5
}
 
function create {
    mongo --eval "db.createCollection('c')"
}
 
function insert {
    threads=24
    for t in $(seq $threads); do
        mongo --eval "
            count = 2000000
            every = 10000
            for (var i=0; i<count; ) {
                var bulk = db.c.initializeUnorderedBulkOp();
                for (var j=0; j<every; j++, i++)
                    bulk.insert({})
                bulk.execute();
                print(i)
            }
        " &
    done
    wait
}
 
function remove {
    mongo --eval "db.c.remove({})"
}
 
function find() {
    mongo --eval "
        n = 500000
        t = new Date()
        db.eval(function(n) {
            q = {_id: ObjectId()}
            for (var i=0; i<n; i++)
                db.c.findOne(q)
        }, n)
        t = new Date() - t
        print(Math.floor(1000*n/t), 'q/s')
    "
}
 
start
create; find; find; find           # newly created empty collection
insert; remove; find; find; find   # fill, empty, then measure find

Comment by Bruce Lucas (Inactive) [ 16/Oct/15 ]

How much memory should one expect an _id index table that is full of nothing but tombstone markers to use? After I do the findOne() to get rid of the tombstone markers in the collection table, and verify that the slow performance lookups in the _id index table are still in progress, the WT statistic reports that bytes currently in the cache drops from about 1.6 GB to essentially 0 (1.9 MB to be exact). The documents in the collection were tiny. Is it plausible that the tombstone markers in the _id index table would take up so much less space (1.9 MB) than tombstone markers in the collection table (1.6 GB)?

Comment by Bruce Lucas (Inactive) [ 16/Oct/15 ]

Agreed that while changing to use search instead of search_near for _id lookups sounds like a good thing and would ease the symptoms, it may leave other uses of the index vulnerable to the underlying problem, and we should address that as well.

Also while the TTL behavior also may not be optimal, I don't think it's the underlying problem either since the problem (low performance of _id lookups) persists after the tables have been completely emptied and TTL activity has ceased.

How can we 1) determine definitively whether the problem is unevictable tombstones in the _id index, and 2) whether it is due to a long-running transaction, and if so 3) who is opening the transaction?

(Regarding the experiment I reported on above: I believe the behavior when I did the findOne was a red herring. I repeated the experiment doing the findOne while the chunk move was stuck instead of waiting until after it became unstuck, and it showed the same behavior: first time was slow, second time was fast, and the cache bytes fell. I think this is because findOne does not use the _id index table but only the collection table, so the findOne only cleared out the tombstones in the collection table, which had not been touched by the running recvChunk because it was consulting only the index table. Conversely the findOne had no effect on the running recvChunk transaction, nor did any other query, e.g. a query that traversed the _id index. So the net is I now think that experiment didn't tell us anything about the cause of the performance issue with _id lookups, other than perhaps to show that whatever is preventing tombstones from being removed from the _id index table is not causing a similar problem with the collection table).

Comment by Michael Cahill (Inactive) [ 16/Oct/15 ]

I tried a slightly different approach to making this path do a search in 3.0. There is a patch build here:

https://evergreen.mongodb.com/build/mongodb_mongo_v3.0_linux_64_33b39af9a7660624928f9fc2ed647a14becc1f30_15_10_16_06_11_16

I haven't yet had a chance to run bruce.lucas's reproducer against this build – I'm hopeful that it may ease the symptom, but I hope we can also get a better understanding of the underlying problem.

Comment by Alexander Gorrod [ 16/Oct/15 ]

We should be able to change _id lookups (and in fact any unique index) to call WT_CURSOR::search rather than search_near. That will be faster in general and avoid this next/prev behavior in this case if the collection is full of tombstones.

I took a run at making that change. It's possible to make it work, but there is code that expects a failed call to _locate to leave the cursor positioned. I coded around about 4 places that have expectations about search semantics, but there were more. My feeling is that switching from a search_near to a search for all queries on a unique index would be possible but it is an invasive change.

It's simple to try a search in the unique code and fall back to a search_near if the key isn't found. The cost there is if a search is done for a non-existant entry we pay the cost of an additional search. The storage_wiredtiger_index_test program uncovers lots of such cases. Thinking more about this alternative, I don't think it will help in the users case - they are doing searches for keys that don't exist, otherwise we wouldn't see next/prev calls in the profiler traces.

Comment by Michael Cahill (Inactive) [ 15/Oct/15 ]

bruce.lucas, one thing I didn't address:

if a large number of tombstones can't be deleted because of an open transaction, why doesn't that show up as a large range of pinned ids for the duration of the time that the transaction is open?

The tombstones should become candidates for eviction as soon as the pinned range goes to zero. They should be evicted either by cache pressure (if the cache is full), or the next time a cursor moves over them, assuming there isn't some other thread blocking the eviction.

We do build in some delays to avoid repeatedly retrying eviction if it fails, but I would not expect that to block eviction in this case. In summary, I would also expect to be able to evict tombstones by "B" at the latest.

Comment by Michael Cahill (Inactive) [ 15/Oct/15 ]

bruce.lucas, the pinned ID statistic is related to running transactions, not data in cache. WiredTiger maintains an "oldest ID", which is updated regularly to track the oldest transaction in the system. The statistic indicates how many update transactions have started since.

The reason why I had assumed that TTL was responsible was the shape of the pinned transaction graph – it drops as soon as the TTL pass completes (at "C" in your first graph).

I could of course be reading it wrong and the migration could have a transaction active and be blocked behind TTL.

Comment by Michael Cahill (Inactive) [ 15/Oct/15 ]

geert.bosch, I know you're looking at the 3.2 TTL code in SERVER-19334 so I was hoping you can help here.

What I'm seeing from the WiredTiger side in 3.0 is that a transaction is running for the whole of a TTL pass. In bruce.lucas's repro, that means ~3 million remove operations have to be buffered in cache.

It looks like the deleteObjects call should yield and split the transaction into pieces (because YIELD_AUTO is set), and there is a separate WriteUnitOfWork around each operation in DeleteStage::work. Do you have any ideas about why a transaction could be running for the whole time?

Comment by Bruce Lucas (Inactive) [ 15/Oct/15 ]

I wonder if the issue is not be a long-running ttl transaction, but rather a long-running transaction related to receiving the long-running recvChunk. Tried the following experiment:

  • at A the chunk migration to this shard begins, and gets stuck. This stops inserts on this shard because the hot chunk always remains on the other shard.
  • ttl deletions on this shard continue for about 10 minutes until B, at which point the collection on this shard has been emptied and all that remains in the tables is tombstones.
  • the chunk migration to this shard remains stuck, I believe because the tombstones that can't be deleted are slowing it down sufficiently that it will never catch up with the operations (both inserts and deletes) proceeding on the other shard on the chunk that is in migration to this shard.
  • at C I stopped the load, and also stopped the balancer to prevent future chunk migrations (to keep things simpler)
  • at D the chunk migration to this shard finally completes (as seen in the _recvChunkStatus stat): by stopping the load, I have given this shard a chance to catch up with the operations that had been happening to that chunk on the other shard while it was in migration.
  • the at E I issued a findOne() on the collection; as expected it returned no results, but took a couple of seconds because of all the tombstones. The second invocation was fast, because the traversal of the tombstones by the first invocation had cleared them all out. Correspondingly we see that the bytes in cache stat falls from 8 GB to 0 almost instantly, because we have now freed the tombstones.

I didn't try doing a findOne during the chunk migration (between C and D) to confirm that it didn't clear out the tombstones, but the stack traces above and the nature of the operations I believe indicate that it would have cleared the tombstones if it could. So my suspicion is that it might be the chunk migration itself that is keeping open a transaction, and it is that very transaction that is preventing the deletion of the tombstones that are slowing it down, so that as long as operations proceed on the donor shard it will never complete.

renctan, the long-running operation in question here is the catch-up phase of receiving a chunk. A typical stack follows. Do any of the functions on this stack hold open a WT transaction for the duration?

Thread 119 (Thread 0x7f747259e700 (LWP 17641)):
#0  0x0000000001318cdb in __wt_tree_walk ()
#1  0x00000000012f0000 in __wt_btcur_prev ()
#2  0x00000000012f670f in __wt_btcur_search_near ()
#3  0x000000000133798a in ?? ()
#4  0x0000000000d69912 in ?? ()
#5  0x0000000000d69b65 in ?? ()
#6  0x0000000000d6bdc8 in ?? ()
#7  0x0000000000a73267 in mongo::BtreeBasedAccessMethod::findSingle(mongo::OperationContext*, mongo::BSONObj const&) const ()
#8  0x00000000009cd2cb in mongo::Helpers::findById(mongo::OperationContext*, mongo::Database*, char const*, mongo::BSONObj, mongo::BSONObj&, bool*, bool*) ()
#9  0x0000000000de97e8 in mongo::MigrateStatus::apply(mongo::OperationContext*, std::string const&, mongo::BSONObj, mongo::BSONObj, mongo::BSONObj, mongo::BSONObj const&, unsigned long long*) ()
#10 0x0000000000dfa14f in mongo::MigrateStatus::_go(mongo::OperationContext*, std::string const&, mongo::BSONObj, mongo::BSONObj, mongo::BSONObj, std::string const&, mongo::OID const&, mongo::WriteConcernOptions const&) ()
#11 0x0000000000dfbb3b in mongo::MigrateStatus::go(mongo::OperationContext*, std::string const&, mongo::BSONObj, mongo::BSONObj, mongo::BSONObj, std::string const&, mongo::OID const&, mongo::WriteConcernOptions const&) ()
#12 0x0000000000de603f in mongo::migrateThread(std::string, mongo::BSONObj, mongo::BSONObj, mongo::BSONObj, std::string, mongo::OID, mongo::WriteConcernOptions) ()
#13 0x0000000000de742d in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string, mongo::BSONObj, mongo::BSONObj, mongo::BSONObj, std::string, mongo::OID, mongo::WriteConcernOptions), boost::_bi::list7<boost::_bi::value<std::string>, boost::_bi::value<mongo::BSONObj>, boost::_bi::value<mongo::BSONObj>, boost::_bi::value<mongo::BSONObj>, boost::_bi::value<std::string>, boost::_bi::value<mongo::OID>, boost::_bi::value<mongo::WriteConcernOptions> > > >::run() ()
#14 0x0000000000fac5f4 in ?? ()
#15 0x00007f747a876182 in start_thread (arg=0x7f747259e700) at pthread_create.c:312
#16 0x00007f747997747d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

michael.cahill, one question: if a large number of tombstones can't be deleted because of an open transaction, why doesn't that show up as a large range of pinned ids for the duration of the time that the transaction is open, which would be presumably A-D?

Comment by Michael Cahill (Inactive) [ 15/Oct/15 ]

In the problematic stack, findSingle for the _id lookup is calling WiredTiger's search_near operation. That operation attempts to find a matching key "close" to the search key. If it lands on a deleted key, it first tries to step next then if that doesn't find a visible record, it does a prev operation to step backwards.

Unfortunately, the API and implementation mean that if there is a huge table and all the records are deleted, each search_near will traverse the whole table before finally returning WT_NOTFOUND. While not a hang, exactly, that's going to be very slow.

The reason those deleted records can't be evicted (so the tombstones remain in the system, and the query has to keep skipping over them) is because there is an old transaction ID keeping updates pinned in cache. It looks like it must be the TTL monitor that is running in one big transaction.

Possible solutions:

  1. Can we yield during the TTL deletes (or otherwise split the operation into bounded pieces)? Doing ~3M deletes in a single WiredTiger transaction is possible but problematic.
  2. We should be able to change _id lookups (and in fact any unique index) to call WT_CURSOR::search rather than search_near. That will be faster in general and avoid this next/prev behavior in this case if the collection is full of tombstones.

Update: I note that MongoDB 3.1+ does search on unique indexes rather than search_near. That may contribute to this issue not reproducing on recent development releases.

Comment by Bruce Lucas (Inactive) [ 13/Oct/15 ]

Repro script below. This was run on a machine with 12 cores / 24 cpus, 64 GB memory. You may need to adjust cache and/or number of threads for a smaller machine. However note that it seems to be load-dependent: with threads set to 1 it did not reproduce.

db=/ssd/db # configure this
 
function start {
    killall -9 -w mongo mongos mongod
    rm -rf $db
    mkdir -p $db/{s0r0,s1r0,c0,ms0}
    mongod --dbpath $db/s0r0 --logpath $db/s0r0.log --port 30000 --storageEngine wiredTiger --nojournal --wiredTigerCacheSizeGB 10 --fork
    mongod --dbpath $db/s1r0 --logpath $db/s1r0.log --port 31000 --storageEngine wiredTiger --nojournal --wiredTigerCacheSizeGB 10 --fork
    mongod --configsvr --dbpath $db/c0 --logpath $db/c0.log --port 27019  --fork
    sleep 2
    mongos --logpath $db/ms0.log --port 27017 --configdb localhost:27019 --fork
    sleep 2
}
 
function setup {
    mongo admin --eval "
        printjson(sh.addShard('localhost:30000'))
        printjson(sh.addShard('localhost:31000'))
    "
    mongo --eval "
        printjson(sh.enableSharding('test'))
        printjson(sh.shardCollection('test.c', {_id:1}))
        printjson(db.c.createIndex({t:1}, {expireAfterSeconds: 10*60}))
    "
}
 
function run {
 
    threads=4
 
    for t in $(seq $threads); do
 
        mongo --eval "
    
            x = ''
            for (var i=0; i<1000; i++)
                x += 'x'
        
            every = 100
            for (var i=0; ; ) {
                if (i%10000==0)
                    print(i)
                doc = {t: ISODate(), x:x}
                var bulk = db.c.initializeUnorderedBulkOp();
                for (var j=0; j<every; j++, i++)
                    bulk.insert(doc)
                bulk.execute()
            }
        " &
 
    done
 
    wait
}
 
start; setup; run

Comment by Bruce Lucas (Inactive) [ 13/Oct/15 ]

Version survey:

  • 3.0.6 - problems as described above
  • 3.0.7 - one run, appears same (stuck migration)
  • 3.1.9 - sharding behavior was substantially different: all data went initially to shard 0, and was then migrated to shard 1, so it appears there was no "hot chunk" migration. The problem did not occur in about 2.5 hours of runtime, although performance did decline somewhat gradually, primarily it appeared because ttl passes were taking longer. Unknown whether the behavior is better because an underlying problem (such as skipping over deleted items) was improved, or if the more serial pattern of inserts and deletes due to the different migration pattern simply avoided the issue; more investigation needed.
  • 3.1.9, shard key {_id: 'hashed'}: initially good, but after about 50 minutes performance plummeted suddenly by more than an order of magnitude. Unkown whether that is same or separate issue; more investigation needed.
Comment by Bruce Lucas (Inactive) [ 13/Oct/15 ]

The repro seems to be reliable, with one of two outcomes:

  1. a chunk move gets stuck, apparently indefinitely, blocking all future balancing operations.
  2. under some conditions tbd, after the chunk move gets stuck, the cache can become full with apparently nothing that can be evicted, at which point essentially everything on that shard is stuck.

Detailed analysis of an incident of the first type:

A migration to shard 0 got stuck apparently indefinitely - from currentOp

    "desc" : "migrateThread",
    "secs_running" : 59789,
    "msg" : "step 3 of 5",

Message means it has completed step 3, which is the bulk clone, and now it's in the catch-up phase where it is applying ops that have happened while it was cloning.

Here are the last messages in mongod log, which are the ones related to the stuck migration:

2015-10-12T16:34:27.451-0400 I SHARDING [migrateThread] starting receiving-end of migration of chunk { _id: ObjectId('561c1953f01c5c19650907c2') } -> { _id: MaxKey } for collection test.c from localhost:31000 at epoch 561c166d7656c939213178e7
2015-10-12T16:34:27.452-0400 I SHARDING [migrateThread] Deleter starting delete for: test.c from { _id: ObjectId('561c1953f01c5c19650907c2') } -> { _id: MaxKey }, with opId: 31196867
2015-10-12T16:34:27.452-0400 I SHARDING [migrateThread] rangeDeleter deleted 0 documents for test.c from { _id: ObjectId('561c1953f01c5c19650907c2') } -> { _id: MaxKey }

(The deleting is about removing anything on that shard that may happen to already be there in that chunk range in preparation for moving the chunk there; as expected nothing was deleted.)

Here's the range being migrated from shard 1 to shard 0:

{ _id: ObjectId('561c1953f01c5c19650907c2') } -> { _id: MaxKey }
 
> ObjectId('561c1953f01c5c19650907c2').getTimestamp()
ISODate("2015-10-12T20:34:27Z")

Note that the timestamp for the _id marking the beginning of the range is about the same time the chunk move began, so the chunk being migrated is the current "hot" chunk i.e. the chunk receiving all the current ObjectIDs, presumably being migrated proactively from shard 1 to shard 0 so it doesn't have to be migrated later when it has gotten big.

Here are some stats relating to that migration:

  • The stuck migration starts at A, marked by the _recvChunkStart command, and it remains in progress thereafter, marked by the stready stream of _recvChunkStatus commands which relate to the chunk move.
  • No further application inserts are received on this shard after A because the all migrations are stuck so the hot chunk receiving updates can't move off of the other shard.
  • Starting a little before the stuck migration we see some ttl deletes, marked by ttl passes, cursor removes, and pinned transactions. The ttl activity continues for about 10 minutes (the ttl timeout period), at which point we have deleted everything that is on this shard and we are not receiving any further data. (Confirmed that collection size on this shard is 0.)

Taken together the above are a signature of this issue.

Here are some selected stack traces showing activity at each call site for that same timeline, showing the same landmarks A, B, C:

  • The chunk move is stuck in MigrateStatus::apply (at 2 above) processing deletions, presumably the deletions that have occurred on the other shard since it did the bulk clone of the chunk being moved.
  • Most of that time is spent in findSingle looking up an object by _id, from two different places, one directly within apply (at 5, to see if the _id exists) and one within deleteObjects (in 3, to delete it).

I wonder if the issue is that the ttl deletions on this shard have made those _id lookups slow due to an accumulation of deleted items, and so the to shard is not able to process the catch-up ops as quickly as they are being generated on the from-shard, so it never catches up. Note in fact by the way that the collection has been entirely emptied by ttl deletions (confirmed as mentioned above that collection count on the stuck shard is 0), so all we are doing it appears is skipping over deleted items at this point; so why are we unable to get rid of the deleted items?

Comment by Bruce Lucas (Inactive) [ 12/Oct/15 ]

Another repro attempt with gdbmon running the whole time did not hit the issue. However it does have some information about why the insert rate drops so dramatically after ttl deletions start: we are spending a lot of time in splitVector (for reasons tbd), and mongos calls splitVector sequentially with insert, so insert performance drops. You'll note that some of this time is in wt, but there is also substantial time (if not more) being spent in the mongod layer in things like toBson and woCompare. Not clear why that is the case, also not clear if this is really related to the hang issue. Files attached as repro-04-*.

Regarding the apparent interaction between the balancer and ttl deletions: suppose we move an old chunk to a shard at the same time new data is being inserted. The end result I think would have the old and new records interspersed (possibly even interleaved) in the wt table, and then when we delete the old records we will be deleting every other record (or something similar). Perhaps this is a case that is problematic for WT?

Comment by Bruce Lucas (Inactive) [ 12/Oct/15 ]

Here's a run of the same ttl workload standalone (complete stats as standalone-diagnostic.data.tar):

  • insert performance remains high through the ttl deletion passes at A-B, C-D, E-F, G-H, I-.
  • delete performance remains consistently high, whereas in the sharded case delete performance became spikey and overall declined as each ttl pass progressed (compare "cursor remove calls" stat).
  • range of IDs pinned remains small throughout.

One difference between the sharded and non-sharded situation is that the collections in the document will have been moved (assuming the balancer is on) from one shard to another; perhaps those moves leave the collecton in a state such that a) deletes from the collection decline in performance throughout a ttl pass, and also dramatically reduce the performance of the inserts and b) makes it prone to the hang. For example, the chunk moves may have disordered the documents in the ttl index and/or collection.

Generated at Thu Feb 08 03:55:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.