Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22199

Collection drop command during checkpoint causes complete stall until end of checkpoint

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Completed:

      Description

      Here's an example:

      • Checkpoint from K to N
      • Drop command at L
      • This results in a complete stall from M to N (including the ftdc thread, which explains the lack of ftdc samples from M to N)

      Stack traces collected during the stall show why: multiple operations are all stuck in dropAllQueued which in turn is blocked in __wt_session_drop:

      00000066`976bf198 00007ffc`4b93dd39 ntdll!ZwWaitForSingleObject+0xa
      00000066`976bf1a0 00007ffc`4b93b704 ntdll!RtlLockHeap+0x169
      00000066`976bf270 00007ff7`ef7e839d ntdll!RtlEnterCriticalSection+0xe4
      00000066`976bf2b0 00007ff7`ef7e4d07 mongod!__wt_session_drop+0x8d
      00000066`976bf2e0 00007ff7`ef18cfd2 mongod!__session_drop+0xe7
      00000066`976bf340 00007ff7`ef19f2ad mongod!mongo::WiredTigerKVEngine::dropAllQueued+0x162
      00000066`976bf4c0 00007ff7`ef19c507 mongod!mongo::WiredTigerSessionCache::releaseSession+0x19d
      00000066`976bf520 00007ff7`ef19c584 mongod!mongo::WiredTigerRecoveryUnit::~WiredTigerRecoveryUnit+0x57
      00000066`976bf560 00007ff7`eef1d6c9 mongod!mongo::WiredTigerRecoveryUnit::`scalar deleting destructor'+0x14
      00000066`976bf590 00007ff7`eef1d724 mongod!mongo::OperationContextImpl::~OperationContextImpl+0x79
      00000066`976bf5d0 00007ff7`ef0fb839 mongod!mongo::OperationContextImpl::`scalar deleting destructor'+0x14
      00000066`976bf600 00007ff7`ef812245 mongod!mongo::ServiceContext::OperationContextDeleter::operator()+0x59
      00000066`976bf640 00007ff7`ef2e57f7 mongod!mongo::MyMessageHandler::process+0xf5
      00000066`976bf860 00007ff7`ef016bdd mongod!mongo::PortMessageServer::handleIncomingMsg+0x4a7
      

      The FTDC thread is similarly stuck:

      00000063`1823f498 00007ffc`4b93dd39 ntdll!ZwWaitForSingleObject+0xa
      00000063`1823f4a0 00007ffc`4b93b704 ntdll!RtlLockHeap+0x169
      00000063`1823f570 00007ff7`ef7e839d ntdll!RtlEnterCriticalSection+0xe4
      00000063`1823f5b0 00007ff7`ef7e4d07 mongod!__wt_session_drop+0x8d
      00000063`1823f5e0 00007ff7`ef18cfd2 mongod!__session_drop+0xe7
      00000063`1823f640 00007ff7`ef19f2ad mongod!mongo::WiredTigerKVEngine::dropAllQueued+0x162
      00000063`1823f7c0 00007ff7`ef19c507 mongod!mongo::WiredTigerSessionCache::releaseSession+0x19d
      00000063`1823f820 00007ff7`ef19c584 mongod!mongo::WiredTigerRecoveryUnit::~WiredTigerRecoveryUnit+0x57
      00000063`1823f860 00007ff7`eef1d6c9 mongod!mongo::WiredTigerRecoveryUnit::`scalar deleting destructor'+0x14
      00000063`1823f890 00007ff7`eef1d724 mongod!mongo::OperationContextImpl::~OperationContextImpl+0x79
      00000063`1823f8d0 00007ff7`ef0fb839 mongod!mongo::OperationContextImpl::`scalar deleting destructor'+0x14
      00000063`1823f900 00007ff7`eee7a490 mongod!mongo::ServiceContext::OperationContextDeleter::operator()+0x59
      00000063`1823f940 00007ff7`eee7bf1d mongod!mongo::FTDCCollectorCollection::collect+0x300
      00000063`1823fb80 00007ff7`ef016bdd mongod!mongo::FTDCController::doLoop+0x37d
      

      I believe the sequence is:

      • at L the drop command attempts to drop the collection but gets an EBUSY from __wt_session_drop due to the checkpoint, so it queues the drop command for later execution
      • every second from L to M an attempt is made to drop the collection again but again gets an EBUSY from __wt_session_drop so it remains queued and operations continue
      • but starting at M __wt_session_drop no longer returns EBUSY but instead gets stuck waiting on the checkpoint, resulting in the stall

      It looks like there are two problems:

      • due to the logic in dropAllQueued in the integration layer, every completing operation on all threads attempts to execute the queued drop; this is why the stall is a complete stall, instead of affecting only one thread
      • starting at M, during the critical phase of the check point (coinciding with the "pinned transactions" stat), WT blocks in __wt_session_drop instead of returning EBUSY

        Attachments

        1. drop-stall.png
          105 kB
          Bruce Lucas

          Issue Links

            Activity

              People

              • Votes:
                1 Vote for this issue
                Watchers:
                18 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: