[SERVER-22209] Collection creation during final phase of checkpoint holds database lock for extended time Created: 16/Jan/16  Updated: 07/Dec/16  Resolved: 04/Apr/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: 3.2.5, 3.3.4

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 2
Labels: WTplaybook, code-and-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File create-stall.png    
Issue Links:
Depends
depends on WT-2346 Don't hold schema lock during checkpo... Closed
Duplicate
is duplicated by SERVER-22642 WiredTiger engine resync stalls with ... Closed
Related
related to WT-2342 Enhance wtperf to support background ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Participants:

 Description   

Here's an example:

  • At B we see an operation acquring the db lock (acquireCount)
  • Apparently that operation blocks because we see many threads queuing up behind it (timeAcquiringMicros), resulting in a complete stall in this case (single db)
  • This has occurred during the final phase of a checkpoint (range of IDs pinned), and ends at the end of the checkpoint
  • In the same app (not shown above) we see other create collections during a checkpoint, but not during the final phase, that do not cause a similar stall, so this appears to be specific to the final phase of a checkpoint.

Stack traces collected during the stall show that the blocked operation is a create collection (this was not evident from the stats and required stack traces because this was an implicit create and we don't have a counter for that), and that it is blocked in __wt_session_create:

00000063`18cac488 00007ffc`4b93dd39 ntdll!ZwWaitForSingleObject+0xa
00000063`18cac490 00007ffc`4b93b704 ntdll!RtlLockHeap+0x169
00000063`18cac560 00007ff7`ef7e827d ntdll!RtlEnterCriticalSection+0xe4
00000063`18cac5a0 00007ff7`ef7e4bbc mongod!__wt_session_create+0x8d
00000063`18cac5d0 00007ff7`ef18c88b mongod!__session_create+0x19c
00000063`18cac630 00007ff7`ef11dae1 mongod!mongo::WiredTigerKVEngine::createRecordStore+0x26b
00000063`18cac860 00007ff7`eed76f75 mongod!mongo::KVDatabaseCatalogEntry::createCollection+0x321
00000063`18cacb00 00007ff7`eee03b13 mongod!mongo::Database::createCollection+0x4b5
00000063`18cacdb0 00007ff7`eee06a1f mongod!mongo::WriteBatchExecutor::ExecInsertsState::_lockAndCheckImpl+0x563
00000063`18cad130 00007ff7`eee06714 mongod!mongo::insertOne+0x14f
00000063`18cad2c0 00007ff7`eee05316 mongod!mongo::WriteBatchExecutor::insertMany+0x164
00000063`18cad390 00007ff7`eee04654 mongod!mongo::WriteBatchExecutor::execInserts+0x1b6
00000063`18cad510 00007ff7`eee05bba mongod!mongo::WriteBatchExecutor::bulkExecute+0xd4
00000063`18cad610 00007ff7`eee0b6d1 mongod!mongo::WriteBatchExecutor::executeBatch+0x3ca
00000063`18cad960 00007ff7`eee2394e mongod!mongo::WriteCmd::run+0x1a1
00000063`18cadb90 00007ff7`eee1ee0e mongod!mongo::Command::run+0x74e
00000063`18cadf70 00007ff7`eed9a717 mongod!mongo::Command::execCommand+0x8ae
00000063`18cae3d0 00007ff7`eeeeb897 mongod!mongo::runCommands+0x297
00000063`18cae6a0 00007ff7`eeee776c mongod!mongo::receivedCommand+0x207
00000063`18caea60 00007ff7`ef81221e mongod!mongo::assembleResponse+0x90c
00000063`18caf160 00007ff7`ef2e57f7 mongod!mongo::MyMessageHandler::process+0xce
00000063`18caf380 00007ff7`ef016bdd mongod!mongo::PortMessageServer::handleIncomingMsg+0x4a7

Stack traces also confirm multiple threads queued up behind db lock:

00000063`18bacde8 00007ffc`48b21118 ntdll!ZwWaitForSingleObject+0xa
00000063`18bacdf0 00007ffc`2748744f KERNELBASE!WaitForSingleObjectEx+0x98
00000063`18bace90 00007ffc`2748515b msvcr120!CRT_RTC_INITW+0x3af
00000063`18baced0 00007ffc`2cfac8c6 msvcr120!Concurrency::details::_Condition_variable::wait_for+0x1ab
00000063`18bacf90 00007ff7`eee1673f msvcp120!Call_onceEx+0xe2
00000063`18bad000 00007ff7`eee1504b mongod!mongo::CondVarLockGrantNotification::wait+0xef
00000063`18bad0a0 00007ff7`eee144e7 mongod!mongo::LockerImpl<0>::lockComplete+0x8b
00000063`18bad250 00007ff7`eee0c725 mongod!mongo::LockerImpl<0>::lock+0x57
00000063`18bad290 00007ff7`eee02cec mongod!mongo::Lock::DBLock::DBLock+0x155
00000063`18bad300 00007ff7`eee036ea mongod!std::make_unique<mongo::Lock::DBLock,mongo::Locker * __ptr64,mongo::StringData,enum mongo::LockMode>+0x5c
00000063`18bad360 00007ff7`eee06a1f mongod!mongo::WriteBatchExecutor::ExecInsertsState::_lockAndCheckImpl+0x13a
00000063`18bad6e0 00007ff7`eee06714 mongod!mongo::insertOne+0x14f
00000063`18bad870 00007ff7`eee05316 mongod!mongo::WriteBatchExecutor::insertMany+0x164
00000063`18bad940 00007ff7`eee04654 mongod!mongo::WriteBatchExecutor::execInserts+0x1b6
00000063`18badac0 00007ff7`eee05bba mongod!mongo::WriteBatchExecutor::bulkExecute+0xd4
00000063`18badbc0 00007ff7`eee0b6d1 mongod!mongo::WriteBatchExecutor::executeBatch+0x3ca
00000063`18badf10 00007ff7`eee2394e mongod!mongo::WriteCmd::run+0x1a1
00000063`18bae140 00007ff7`eee1ee0e mongod!mongo::Command::run+0x74e
00000063`18bae520 00007ff7`eed9a717 mongod!mongo::Command::execCommand+0x8ae
00000063`18bae980 00007ff7`eeeeb897 mongod!mongo::runCommands+0x297
00000063`18baec50 00007ff7`eeee776c mongod!mongo::receivedCommand+0x207
00000063`18baf010 00007ff7`ef81221e mongod!mongo::assembleResponse+0x90c
00000063`18baf710 00007ff7`ef2e57f7 mongod!mongo::MyMessageHandler::process+0xce
00000063`18baf930 00007ff7`ef016bdd mongod!mongo::PortMessageServer::handleIncomingMsg+0x4a7



 Comments   
Comment by Andrew de Quincey [ 23/Apr/16 ]

You're right. Thanks for looking into it though!

Comment by Ramon Fernandez Marina [ 22/Apr/16 ]

adq@tvsquared.com, a backport to the 3.0 series is not feasible, but that should not affect you: I'd recommend you upgrade to 3.0+MMPAv1, and move to 3.2+WiredTiger from there – that way you avoid the configuration affected by this issue (3.0 + WiredTiger).

Comment by Ramon Fernandez Marina [ 26/Jan/16 ]

adq, we need to wait and see how the fix looks for the development branch before we can decide if it's safe to backport to older branches. The decision will be reflected in changes to the "fixVersion" field. Until then I'm requesting a backport to v3.0 on your behalf.

Comment by Andrew de Quincey [ 26/Jan/16 ]

Will this fix be backported to the 3.0 series as well?

We're still on 2.6, so we need to upgrade to 3.0, and then on to 3.2. If this fix isn't backported, there's a window when we'll be running a pure 3.0 system on live at risk of these long lock times.

Comment by Alan Jackson [ 22/Jan/16 ]

Hi Ramon, thanks for getting back so quickly.
I understand that this is WT rather than mongo3+. We are on mongo2.6 and were looking at mongo3+WT for the performance improvements.
We have a high volume (~2000/sec, 4-6k mongo ops/sec) web hit tracking infrastructure, so the write performance improvements in WT where the main appeal of moving - the use case is very simple queries so the other 3.2 improvements weren't immediately relevant.
However we dynamically create collections, and this lock was causing unacceptable delays (it also appears to occur on index creation). We also ran into SERVER-19768 on mongo3.2.
We've temporarily increased the infrastructure under 2.6 and are considering 3.0-MMap for the collection level concurrency, but otherwise we'll have to wait.

Comment by Ramon Fernandez Marina [ 22/Jan/16 ]

ajax@tvsquared.com, this issue only affects the WiredTiger storage engine. If you're considering upgrading note you can still upgrade to MongoDB 3.2 and continue to use the MMAPv1 storage engine. This will allow you to use all the new features in 3.2 (document validation, partial indexes, aggregation improvements) and set the stage for a transition to the WiredTiger storage engine once this issue is resolved.

Comment by Alan Jackson [ 22/Jan/16 ]

This issue (and possibly related index creation locks that behave identically) is currently keeping us on mongodb 2.6.

Comment by Alexander Gorrod [ 21/Jan/16 ]

michael@aorato.com We are actively pursuing a fix for this problem. We have isolated and understand the cause for the problem. We have created a reproducing case inside WiredTiger (see WT-2342), and are working on a fix. Once the fix is complete and tested we will work on back porting it to 3.2.

Comment by Michael Dolinsky [ 21/Jan/16 ]

This is a critical fix for us as we are creating and dropping collections on an hourly basis and due to the stall we could lose data that would otherwise needs to be inserted.
We ask that this would be prioritized as fast as possible for the next minor release of 3.2.

Thank you.

Generated at Thu Feb 08 03:59:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.