[SERVER-22062] Foreground index build may hang 3.0.x WiredTiger node Created: 29/Dec/15  Updated: 22/May/16  Resolved: 18/May/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.8
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Liran Moysi Assignee: Ramon Fernandez Marina
Resolution: Won't Fix Votes: 1
Labels: RF, WTplaybook
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Mongo 3.0.8 + parallel


Attachments: PNG File 22062_pinned_high.png     PNG File 22062_pinned_low.png     HTML File another_stack_trace     PNG File fg_index_hang.png     PNG File fg_index_no_hang.png     File sslog.log.gz     Text File stacktrace.txt     PNG File timeseries.png    
Issue Links:
Duplicate
is duplicated by SERVER-21616 WiredTiger hangs when mongorestoring ... Closed
Related
is related to SERVER-18844 Reacquire the snapshot after commit/a... Closed
Operating System: ALL
Sprint: Integrate+Tuning 15 (06/03/16)
Participants:

 Description   

Using Mongo 3.0.8 when replicating large data set (this did'nt happen in smaller one) with max_threads = 4 and a patch for mongo which parallelize the cloning process, in our setup we used 17 threads cloning different dbs, each thread is holding a db lock (instead of a global lock)

examining with gdb the cache flags shows STUCK bit is lit, here's the
cache structure:

{
  bytes_inmem = 60122154727,
  pages_inmem = 5668935,
  bytes_internal = 243665765,
  bytes_overflow = 0,
  bytes_evict = 486975238126,
  pages_evict = 5648345,
  bytes_dirty = 59789008836,
  pages_dirty = 6448,
  bytes_read = 1482481072,
  evict_max_page_size = 31232046,
  read_gen = 1682697,
  read_gen_oldest = 1682790,
  evict_cond = 0x39abcd0,
  evict_lock = {
    lock = {
      __data = {
        __lock = 0,
        __count = 0,
        __owner = 0,
        __nusers = 0,
        __kind = 0,
        __spins = 0,
        __elision = 0,
        __list = {
          __prev = 0x0,
          __next = 0x0
        }
      },
      __size = '\000' <repeats 39 times>,
      __align = 0
    },
    counter = 0,
    name = 0x17446c2 "cache eviction",
    id = 0 '\000',
    initialized = 1 '\001'
  },
  evict_walk_lock = {
    lock = {
      __data = {
        __lock = 0,
        __count = 0,
        __owner = 0,
        __nusers = 0,
        __kind = 0,
        __spins = 0,
        __elision = 0,
        __list = {
          __prev = 0x0,
          __next = 0x0
        }
      },
      __size = '\000' <repeats 39 times>,
      __align = 0
    },
    counter = 0,
    name = 0x17446d1 "cache walk",
    id = 0 '\000',
    initialized = 1 '\001'
  },
  evict_waiter_cond = 0x39abd40,
  eviction_trigger = 95,
  eviction_target = 80,
  eviction_dirty_target = 80,
  overhead_pct = 8,
  evict = 0x4bd4000,
  evict_current = 0x0,
  evict_candidates = 100,
  evict_entries = 100,
  evict_max = 400,
  evict_slots = 400,
  evict_file_next = 0x570f9c700,
  sync_request = 0,
  sync_complete = 0,
  cp_saved_read = 0,
  cp_current_read = 0,
  cp_skip_count = 0,
  cp_reserved = 0,
  cp_session = 0x0,
  cp_tid = 0,
  flags = 40
}

attached is a stack trace, as you can see all cloning threads are hung on the eviction condition "0x39abd40" (threads 51 through 67) which comes from __wt_cache_full_check() call
Thread #8 also stuck on the same call, due to _deleteExcessDocuments call.
The eviction server (Thread #2) is sleeping, and this happens constantly
the eviction workers seem to have no work as there are 3 live eviction workers (threads 68 through 70) all of which are waiting on the same condition

This situation reproduced itself over and over at some point during the initial clone, any idea as to why this happens would be great.
The small patch for the parallelization is available here:
https://github.com/liranms/mongo/commit/a216bb0d8159f8030b5d666ffa8869c57f28fcc0



 Comments   
Comment by Ramon Fernandez Marina [ 17/May/16 ]

Roy, Liran,

to the best of our knowledge, this bug has surfaced in two use cases:

  • A modified mongod that initial syncs collections in parallel (this ticket)
  • Multiple mongorestore processes running in parallel (SERVER-21616)

In both cases, the root cause is long-running transactions caused by large, parallel, foreground index builds.

After thorough investigation of the technical aspects of this issue here's the summary of our discussion:

  • Unfortunately the fix requires considerable changes in the query layer
  • As you know, users expect more stability with each minor release, but the changes required to fix this ticket have the potential to significantly destabilize the 3.0 branch
  • MongoDB 3.2, which does not exhibit this behavior, is being successfully used in production by many users, and I encourage you to upgrade if this issue is critical for you

You're right that upgrading major versions requires a larger testing effort. However, we believe the risks of fixing this issue in v3.0 outweigh the benefits, so we've decided to not invest any more resources in this ticket. Apologies for the inconvenience.

Regards,
Ramón

Comment by Roy Reznik [ 17/May/16 ]

As you probably understand, upgrading between majors requires heavy testing in order to ensure that the new version is compatible & reliable, therefore it's not an option.
We were promised that this ticket will be taken care of 3 months ago when a repro was finally made, since then, during 3 minor releases, this still wasn't addressed.
Do you have an ETA?

Comment by Ramon Fernandez Marina [ 09/May/16 ]

Hi royrez@microsoft.com, apologies for the delay. As I mentioned above the fix for 3.0 proved more complex than anticipated; in addition, the fact that 3.2 does not exhibit this issue hasn't played in favor of this ticket when competing for resources.

We'd still like to fix this issue in 3.0, but if this is critical for you I'd recommend you evaluate upgrading to 3.2 at your earliest convenience.

Regards,
Ramón.

Comment by Roy Reznik [ 08/May/16 ]

Hello Ramon,
Another month+ has passed, and still no progress on the ticket.
Do you have an ETA? we're getting near half a year!

Comment by Ramon Fernandez Marina [ 01/Apr/16 ]

Apologies for the radio silence royrez@microsoft.com. As per my last message the fix for this issue in the v3.0 branch is not straightforward, so it will require some time to implement.

Note that 3.2 does a much better job in this situation and so far our reproducers have not been able to trigger the problem there.

Anyway, we'll post any further updates about the v3.0 branch on this ticket as they happen.

Regards,
Ramón.

Comment by Roy Reznik [ 31/Mar/16 ]

?

Comment by Roy Reznik [ 14/Mar/16 ]

Any news?
It's a deadlock ticket and it's been 3 months already...

Comment by Ramon Fernandez Marina [ 24/Feb/16 ]

The fix for this ticket needs further investigation, so it won't make it into 3.0.10 – pushing it to a later release.

Comment by Alexander Gorrod [ 04/Feb/16 ]

I've done some more digging into the difference between MongoDB 3.0 and 3.2, and I'm pretty sure I have isolated the change that improved the behavior. In SERVER-17062 there was a change that altered the yield policy for foreground index builds amongst other changes.

I ran tests against MongoDB 3.2. One unpatched build and one build with the foreground index part of SERVER-17062 backed out (see below for details). With that change backed out I see similar similar behavior from MongoDB 3.2 as I see with MongoDB 3.0 - that is the foreground index build does not yield, and thus pins a lot of content in the WiredTiger cache. The following pictures illustrate the difference:

With unpatched MongoDB 3.2, I see the following transaction range of IDs pinned and throughput characteristics:

With patched MongoDB 3.2, I see the following transaction range of IDs pinned and throughput characteristics:

The code change I used for testing is:

--- a/src/mongo/db/catalog/index_create.cpp
+++ b/src/mongo/db/catalog/index_create.cpp
@@ -232,8 +232,10 @@ Status MultiIndexBlock::insertAllDocumentsInCollection(std::set<RecordId>* dupsO
     if (_buildInBackground) {
         invariant(_allowInterruption);
         exec->setYieldPolicy(PlanExecutor::YIELD_AUTO);
+#if 0
     } else {
         exec->setYieldPolicy(PlanExecutor::WRITE_CONFLICT_RETRY_ONLY);
+#endif
     }
 
     Snapshotted<BSONObj> objToIndex;

Comment by Eric Milkie [ 02/Feb/16 ]

To answer the question about yield policies: the different policies between foreground and background index builds is integral to their identities; basically a background index build is an index build that yields (at all), while a foreground index build doesn't yield and locks the database for its entire duration. Locking the database allows the foreground index build to reliably gather all the index keys so that it can sort them before outputting the index tree.
A better solution for building indexes would be a hybrid of foreground and background, where the keys are gathered in a collection scan and sorted before building the tree, but the collection scan can also yield. The build would need to collect all changes to the collection while it was scanning the collection, and then apply all the changes in a catch-up phase at the end. I believe there is a work ticket in the server to make this type of index build.

Comment by Alexander Gorrod [ 01/Feb/16 ]

bruce.lucas I can reproduce a hang using MongoDB 3.0.9 and the reproducer script you outlined. I can't reproduce that failure with a MongoDB 3.2.1 package. The key difference I see between 3.0.9 and 3.2.1 when tracking serverStatus metrics is transaction range of IDs currently pinned. For example, when a run hangs I see:

When it doesn't hang I see:

Notice that the timelines are very different. When the hang happens serverStatus stops being gathered, so we only see a small window into the index build. In the case where the index build doesn't get hung, the number of transaction IDs pinned peaks at 30 thousand, in the case where it hangs the number of IDs pinned grows steadily to over 2.5 million. Pinning such a large range of IDs means that WiredTiger can't find any pages to evict from cache - since all content is needed to satisfy old readers. MongoDB has a concept of yielding operations - which avoids creating long running transactions to avoid exactly this problem.

In the case of foreground index builds MongoDB uses a different yielding policy to most other operations. See https://github.com/mongodb/mongo/blob/master/src/mongo/db/catalog/index_create.cpp#L232-L237 redbeard0531 geert.bosch Could you explain why the yield policy is different for foreground and background index builds?

Comment by Bruce Lucas (Inactive) [ 29/Jan/16 ]

I've been able to reproduce what looks to me like the same hang on vanilla 3.0.8 by doing inserts and an index build in parallel.

  • ssd, 12 core / 24 cpu
  • standalone mongod, 3.0.8, 5 GB cache, no journal (just to make it faster, probably doesn't matter)
  • create a collection with largish documents, then in parallel insert into another collection and index first collection (two collections are in different dbs so index build doesn't block the writing threads)

threads=24
 
function insert {
    db=$1    # database
    c=$2     # collection
    count=$3 # number of docs
    size=$4  # doc size
    for t in $(seq $threads); do
        mongo $db --eval "
            s = ''
            for (var i=0; i<$size; i++)
                s += 's'
            c = db['$c']
            every = 10000
            for (var i=0; i<$count; ) {
                var bulk = c.initializeUnorderedBulkOp()
                for (var j=0; j<every; j++, i++)
                    bulk.insert({x:i, s:s})
                bulk.execute();
                print(i)
            }
        " &
    done
    wait
}
 
insert test1 c    500000 10000                           # create coll test1.c with biggish docs
insert test2 d 100000000  1000 &                         # do writes on test2.d
mongo --eval test1 "printjson(db.c.createIndex({x:1}))"  # and in parallel do index build on test1.c

Shortly after index build starts it will get stuck waiting for full cache:

Thread 26 (Thread 0x7f2e05561700 (LWP 16881)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00000000013ac01e in __wt_cond_wait ()
#2  0x0000000001390fe4 in __wt_cache_wait ()
#3  0x0000000001345770 in __wt_page_in_func ()
#4  0x000000000135a9fe in ?? ()
#5  0x000000000135b402 in __wt_tree_walk ()
#6  0x000000000132ceea in __wt_btcur_next ()
#7  0x000000000137a16c in ?? ()
#8  0x0000000000d9299b in mongo::WiredTigerRecordStore::Iterator::_getNext() ()
#9  0x0000000000d92a53 in mongo::WiredTigerRecordStore::Iterator::getNext() ()
#10 0x00000000009eea84 in mongo::CollectionScan::work(unsigned long*) ()
#11 0x0000000000bd8852 in mongo::PlanExecutor::getNextSnapshotted(mongo::Snapshotted<mongo::BSONObj>*, mongo::RecordId*) ()
#12 0x0000000000915aa0 in mongo::MultiIndexBlock::insertAllDocumentsInCollection(std::set<mongo::RecordId, std::less<mongo::RecordId>, std::allocator<mongo::RecordId> >*) ()
#13 0x000000000093f995 in mongo::CmdCreateIndex::run(mongo::OperationContext*, std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&, bool) ()
#14 0x00000000009c6ef1 in mongo::_execCommand(mongo::OperationContext*, mongo::Command*, std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&, bool) ()
#15 0x00000000009c7f0c in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, int, char const*, mongo::BSONObj&, mongo::BSONObjBuilder&, bool) ()
#16 0x00000000009c8bdb in mongo::_runCommands(mongo::OperationContext*, char const*, mongo::BSONObj&, mongo::_BufBuilder<mongo::TrivialAllocator>&, mongo::BSONObjBuilder&, bool, int) ()
#17 0x0000000000ba57cb in mongo::runQuery(mongo::OperationContext*, mongo::Message&, mongo::QueryMessage&, mongo::NamespaceString const&, mongo::CurOp&, mongo::Message&) ()
#18 0x0000000000ab2572 in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
#19 0x00000000007ea895 in mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*, mongo::LastError*) ()
#20 0x0000000000f3eb89 in mongo::PortMessageServer::handleIncomingMsg(void*) ()
#21 0x00007f2e0b12e182 in start_thread (arg=0x7f2e05561700) at pthread_create.c:312
#22 0x00007f2e0a22f47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

So are the insert threads:

Thread 25 (Thread 0x7f2e04854700 (LWP 16978)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00000000013ac01e in __wt_cond_wait ()
#2  0x0000000001390fe4 in __wt_cache_wait ()
#3  0x00000000013d9169 in ?? ()
#4  0x0000000000d98f42 in mongo::WiredTigerRecoveryUnit::_txnOpen(mongo::OperationContext*) ()
#5  0x0000000000d99063 in mongo::WiredTigerRecoveryUnit::getSession(mongo::OperationContext*) ()
#6  0x0000000000d990aa in mongo::WiredTigerCursor::WiredTigerCursor(std::string const&, unsigned long, bool, mongo::OperationContext*) ()
#7  0x0000000000d964ff in mongo::WiredTigerRecordStore::insertRecord(mongo::OperationContext*, char const*, int, bool) ()
#8  0x00000000008f9b21 in mongo::Collection::_insertDocument(mongo::OperationContext*, mongo::BSONObj const&, bool) ()
#9  0x00000000008fa66c in mongo::Collection::insertDocument(mongo::OperationContext*, mongo::BSONObj const&, bool) ()
#10 0x000000000099d9b1 in mongo::WriteBatchExecutor::execOneInsert(mongo::WriteBatchExecutor::ExecInsertsState*, mongo::WriteErrorDetail**) ()
#11 0x000000000099e76d in mongo::WriteBatchExecutor::execInserts(mongo::BatchedCommandRequest const&, mongo::WriteConcernOptions const&, std::vector<mongo::WriteErrorDetail*, std::allocator<mongo::WriteErrorDetail*> >*) ()
#12 0x00000000009a06e9 in mongo::WriteBatchExecutor::bulkExecute(mongo::BatchedCommandRequest const&, mongo::WriteConcernOptions const&, std::vector<mongo::BatchedUpsertDetail*, std::allocator<mongo::BatchedUpsertDetail*> >*, std::vector<mongo::WriteErrorDetail*, std::allocator<mongo::WriteErrorDetail*> >*) ()
#13 0x00000000009a0f15 in mongo::WriteBatchExecutor::executeBatch(mongo::BatchedCommandRequest const&, mongo::BatchedCommandResponse*) ()
#14 0x00000000009a3cfc in mongo::WriteCmd::run(mongo::OperationContext*, std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&, bool) ()
#15 0x00000000009c6ef1 in mongo::_execCommand(mongo::OperationContext*, mongo::Command*, std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&, bool) ()
#16 0x00000000009c7f0c in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, int, char const*, mongo::BSONObj&, mongo::BSONObjBuilder&, bool) ()
#17 0x00000000009c8bdb in mongo::_runCommands(mongo::OperationContext*, char const*, mongo::BSONObj&, mongo::_BufBuilder<mongo::TrivialAllocator>&, mongo::BSONObjBuilder&, bool, int) ()
#18 0x0000000000ba57cb in mongo::runQuery(mongo::OperationContext*, mongo::Message&, mongo::QueryMessage&, mongo::NamespaceString const&, mongo::CurOp&, mongo::Message&) ()
#19 0x0000000000ab2572 in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
#20 0x00000000007ea895 in mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*, mongo::LastError*) ()
#21 0x0000000000f3eb89 in mongo::PortMessageServer::handleIncomingMsg(void*) ()
#22 0x00007f2e0b12e182 in start_thread (arg=0x7f2e04854700) at pthread_create.c:312
#23 0x00007f2e0a22f47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Comment by Roy Reznik [ 13/Jan/16 ]

Any news on this?
Seems like a pretty serious bug and doesn't seem like it's getting addressed.

Comment by Liran Moysi [ 06/Jan/16 ]

Michael, thanks for the details, you suggested that the long running transaction is causing the hang, the server has been running for additional 10 hours after the last timeseries snapshot, so it sounds like a full hang (perhaps a deadlock)
I also should clarify that the hang happens during the index build stage (the db copy stage of the initial sync completes successfully)
This did not happen during single threaded sync, however the single threaded sync does not offer the same load as the multi threaded so this elusive bug/deadlock may only manifest itself in this situation...
The reason we are using the multi sync patch is because the single threaded patch is not fast enough and the newly synced node becomes stale after the index build phase completes.
I should also mention that the multi sync patch works successfully on both mmapv1 and rocksdb initial sync, so this could be a wiredtiger specific issue, can you think of any code that assumed a global lock during index build phase ?

Comment by Michael Cahill (Inactive) [ 06/Jan/16 ]

This hang is caused by a long-running transaction.

In particular, from the timeseries data, it looks like there was a previous long-running transaction that completed by doing some deletes, but the final transaction has been running for almost 10 minutes without completing:

It is difficult to say without reproducing this whether it is caused by the patch, but to the best of my knowledge, we haven't seen this in single-threaded initial sync.

liranms, can you confirm whether this occurs on an unpatched mongod?

Comment by Liran Moysi [ 30/Dec/15 ]

@Keith.Bostic - its a full hang
@Alexander.Gorrod -
I restarted the sync process from zero, and ran the command you suggested, as you can see it works fine until a certain point when it stops responding, about 1.5 hour into the sync. sending the command db.serverStatus(..) manually just hangs the mongo client
I uploaded the file for you to check out + another stack trace for this situation (which is quite similar to the first one)

Comment by Alexander Gorrod [ 29/Dec/15 ]

The reason all content would be stuck in cache is likely to be a long running transaction. If you can gather periodic server status statistics for the period where the workload is running, we'll be able to get better visibility.

Could you run the following:

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000)}" >ss.log &

Then upload the resulting log file.

Comment by Keith Bostic (Inactive) [ 29/Dec/15 ]

liranms, a quick clarification: is it a full hang, no more progress ever, or do things eventually get unstuck and proceed, and that's what you meant by the situation reproducing itself?

If it's the latter, do things get unstuck quickly, or does it take a long time? And then, how until things get stuck again?

Generated at Thu Feb 08 03:59:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.