[SERVER-22199] Collection drop command during checkpoint causes complete stall until end of checkpoint Created: 15/Jan/16  Updated: 18/Nov/16  Resolved: 02/Feb/16

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: None
Fix Version/s: 3.2.3, 3.3.2

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 1
Labels: WTplaybook, code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File drop-stall.png    
Issue Links:
Depends
depends on WT-2333 Add a flag so drop doesn't block Closed
depends on WT-2334 Add support for background drops Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Participants:

 Description   

Here's an example:

  • Checkpoint from K to N
  • Drop command at L
  • This results in a complete stall from M to N (including the ftdc thread, which explains the lack of ftdc samples from M to N)

Stack traces collected during the stall show why: multiple operations are all stuck in dropAllQueued which in turn is blocked in __wt_session_drop:

00000066`976bf198 00007ffc`4b93dd39 ntdll!ZwWaitForSingleObject+0xa
00000066`976bf1a0 00007ffc`4b93b704 ntdll!RtlLockHeap+0x169
00000066`976bf270 00007ff7`ef7e839d ntdll!RtlEnterCriticalSection+0xe4
00000066`976bf2b0 00007ff7`ef7e4d07 mongod!__wt_session_drop+0x8d
00000066`976bf2e0 00007ff7`ef18cfd2 mongod!__session_drop+0xe7
00000066`976bf340 00007ff7`ef19f2ad mongod!mongo::WiredTigerKVEngine::dropAllQueued+0x162
00000066`976bf4c0 00007ff7`ef19c507 mongod!mongo::WiredTigerSessionCache::releaseSession+0x19d
00000066`976bf520 00007ff7`ef19c584 mongod!mongo::WiredTigerRecoveryUnit::~WiredTigerRecoveryUnit+0x57
00000066`976bf560 00007ff7`eef1d6c9 mongod!mongo::WiredTigerRecoveryUnit::`scalar deleting destructor'+0x14
00000066`976bf590 00007ff7`eef1d724 mongod!mongo::OperationContextImpl::~OperationContextImpl+0x79
00000066`976bf5d0 00007ff7`ef0fb839 mongod!mongo::OperationContextImpl::`scalar deleting destructor'+0x14
00000066`976bf600 00007ff7`ef812245 mongod!mongo::ServiceContext::OperationContextDeleter::operator()+0x59
00000066`976bf640 00007ff7`ef2e57f7 mongod!mongo::MyMessageHandler::process+0xf5
00000066`976bf860 00007ff7`ef016bdd mongod!mongo::PortMessageServer::handleIncomingMsg+0x4a7

The FTDC thread is similarly stuck:

00000063`1823f498 00007ffc`4b93dd39 ntdll!ZwWaitForSingleObject+0xa
00000063`1823f4a0 00007ffc`4b93b704 ntdll!RtlLockHeap+0x169
00000063`1823f570 00007ff7`ef7e839d ntdll!RtlEnterCriticalSection+0xe4
00000063`1823f5b0 00007ff7`ef7e4d07 mongod!__wt_session_drop+0x8d
00000063`1823f5e0 00007ff7`ef18cfd2 mongod!__session_drop+0xe7
00000063`1823f640 00007ff7`ef19f2ad mongod!mongo::WiredTigerKVEngine::dropAllQueued+0x162
00000063`1823f7c0 00007ff7`ef19c507 mongod!mongo::WiredTigerSessionCache::releaseSession+0x19d
00000063`1823f820 00007ff7`ef19c584 mongod!mongo::WiredTigerRecoveryUnit::~WiredTigerRecoveryUnit+0x57
00000063`1823f860 00007ff7`eef1d6c9 mongod!mongo::WiredTigerRecoveryUnit::`scalar deleting destructor'+0x14
00000063`1823f890 00007ff7`eef1d724 mongod!mongo::OperationContextImpl::~OperationContextImpl+0x79
00000063`1823f8d0 00007ff7`ef0fb839 mongod!mongo::OperationContextImpl::`scalar deleting destructor'+0x14
00000063`1823f900 00007ff7`eee7a490 mongod!mongo::ServiceContext::OperationContextDeleter::operator()+0x59
00000063`1823f940 00007ff7`eee7bf1d mongod!mongo::FTDCCollectorCollection::collect+0x300
00000063`1823fb80 00007ff7`ef016bdd mongod!mongo::FTDCController::doLoop+0x37d

I believe the sequence is:

  • at L the drop command attempts to drop the collection but gets an EBUSY from __wt_session_drop due to the checkpoint, so it queues the drop command for later execution
  • every second from L to M an attempt is made to drop the collection again but again gets an EBUSY from __wt_session_drop so it remains queued and operations continue
  • but starting at M __wt_session_drop no longer returns EBUSY but instead gets stuck waiting on the checkpoint, resulting in the stall

It looks like there are two problems:

  • due to the logic in dropAllQueued in the integration layer, every completing operation on all threads attempts to execute the queued drop; this is why the stall is a complete stall, instead of affecting only one thread
  • starting at M, during the critical phase of the check point (coinciding with the "pinned transactions" stat), WT blocks in __wt_session_drop instead of returning EBUSY


 Comments   
Comment by Michael Cahill (Inactive) [ 22/Feb/16 ]

michael@aorato.com, yes, this issue is fixed in 3.2.3. No changes are required to the application: drops during checkpoints will now return immediately without blocking. The disk space is reclaimed asynchronously.

Comment by Michael Dolinsky [ 21/Feb/16 ]

Just to make sure
This is fixed in 3.2.3?
If yes, what is fixed mean? we don't need to do anything and the issue will not appear or do we need to add a flag somewhere for this to be fixed?

Comment by Ramon Fernandez Marina [ 02/Feb/16 ]

That is correct michael@aorato.com. We're working on a 3.2.2-rc0 release candidate, scheduled for the coming days, and it would be useful if you could test it and confirm that the fix produces the desired results in your use case.

Comment by Michael Dolinsky [ 02/Feb/16 ]

Thanks.
so this is fixed due to the new flag which by default the drops are now happening in the background? (https://jira.mongodb.org/browse/WT-2333)

Comment by Githook User [ 02/Feb/16 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-22199 Don't wait for WT locks during drop.

(cherry picked from commit aa98ee551e0fcface0a1c9d63001c269c068d9ac)
Branch: v3.2
https://github.com/mongodb/mongo/commit/b7bb72efb1e3f96398f1a416de7a102f4eb17f82

Comment by Githook User [ 02/Feb/16 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-22199 Don't wait for WT locks during drop.
Branch: master
https://github.com/mongodb/mongo/commit/aa98ee551e0fcface0a1c9d63001c269c068d9ac

Comment by Ramon Fernandez Marina [ 21/Jan/16 ]

michael@aorato.com, a short-term fix (WT-2333) is already in code review and should be part of the next stable release. A more comprehensive approach is being investigated in WT-2334 for later releases. We'll post updates to this ticket as they become available.

Comment by Michael Dolinsky [ 21/Jan/16 ]

This is a critical fix for us as we are creating and dropping collections on an hourly basis and due to the stall we could lose data that would otherwise needs to be inserted.
We ask that this would be prioritized as fast as possible for the next minor release of 3.2.

Thank you.

Generated at Thu Feb 08 03:59:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.