[SERVER-31101] WT table not dropped after collection is dropped due to long-running OperationContext Created: 15/Sep/17  Updated: 30/Oct/23  Resolved: 16/Nov/17

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.2.13, 3.4.9
Fix Version/s: 3.2.18, 3.4.11, 3.6.0-rc1

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Eric Milkie
Resolution: Fixed Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File drops_actually_remove_wt_files.js    
Issue Links:
Backports
Duplicate
duplicates SERVER-32082 dropDatabase reports success but db i... Closed
is duplicated by SERVER-31573 WT table not dropped on primary after... Closed
is duplicated by SERVER-33108 Collection deleted - file still exists Closed
is duplicated by SERVER-34553 dropCollection does not free up stora... Closed
Related
related to SERVER-30426 dropDatabase very slow due to repeate... Closed
related to SERVER-31573 WT table not dropped on primary after... Closed
related to SERVER-32030 SyncTail::oplogApplication uses a lon... Closed
is related to SERVER-26870 Sometimes collection data file is not... Closed
is related to SERVER-31587 ReplBatcher has a long-running Operat... Closed
is related to SERVER-31589 A primary's SyncSourceFeedback only r... Closed
is related to SERVER-31590 WT yieldAndAwaitOplogDeletionRequest ... Closed
is related to SERVER-31591 HMAC key refresh thread holds onto an... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Participants:
Case:

 Description   

In a replica set

  • create a collection by inserting a document
  • restart the replica set
  • then drop the collection or database

On the primary the corresponding WT tables will be dropped, but on the secondaries the index table will be dropped but not the collection table - the .wt file will remain, and the ftdc data will show a steady stream of 10 failed table drops per second.



 Comments   
Comment by Githook User [ 17/Nov/17 ]

Author:

{'name': 'Katherine Walker', 'username': 'kvwalker', 'email': 'katherine.walker@mongodb.com'}

Message: SERVER-31101 Drop WT table on secondary after collection is dropped

(cherry picked from commit 0f9d6085526921dd8edaffce827ffd07653ee4cd)
Branch: v3.2
https://github.com/mongodb/mongo/commit/85eec1d61a2bc6a7b7a1391fd787ad00eadc5060

Comment by Githook User [ 17/Nov/17 ]

Author:

{'name': 'Katherine Walker', 'username': 'kvwalker', 'email': 'katherine.walker@mongodb.com'}

Message: SERVER-31101 Drop WT table on secondary after collection is dropped

(cherry picked from commit 0f9d6085526921dd8edaffce827ffd07653ee4cd)
Branch: v3.4
https://github.com/mongodb/mongo/commit/d35752b5a2049f1067b325952d4ea39302521bea

Comment by Eric Milkie [ 16/Nov/17 ]

At this moment the fixes are in the 3.6 and master branches, but not yet in the 3.4 or 3.2 branches.

Comment by Eric Milkie [ 07/Nov/17 ]

The work for the subtickets has been completed and pushed into the master branch in preparation for the 3.6 release. The 3.4 release will be fixed as soon as SERVER-31587 and SERVER-31589 are backported there. I will resolve this ticket at that time.

Comment by Eric Milkie [ 18/Oct/17 ]

It turns out my idea wasn't so easy, as our code assumes you can hold cursors outside of a transaction and return it to the cache later on.

Comment by Eric Milkie [ 18/Oct/17 ]

My POC for this will be to return the RecoveryUnit's session back to the cache when a transaction finishes. I believe that this will actually fix the problem described in the ticket with minimal code changes, and the performance repercussions will be minimal. It would really only possibly affect operations that do many transactions using the same OperationContext, and I can't think of any situations that do this enough that it would be noticeable.
We'd also need to root out and find read transactions that were not properly closed with abandonSnapshot(), but the list of such places will be much shorter than the list of long running OperationContexts.

Comment by Daniel Gottlieb (Inactive) [ 18/Oct/17 ]

Moving back to 3.6 Required (can be downgraded to 3.5 Desired if that's the decision).

It was only moved to RC1 when the ticket was (temporarily) resolved.

Comment by Michael Cahill (Inactive) [ 18/Oct/17 ]

First up, I don't believe this issue is an RC1 blocker. It exists in earlier versions of MongoDB, I don't see any evidence that 3.6 work has made it worse or that anything else in 3.6 relies on this being resolve.

Like daniel.gottlieb and bruce.lucas, I have been thinking about whether we can disable caching to avoid these issues. I like the idea above of flagging long-running operations and treating them differently. donald.anderson has also been working on improving the performance of opening WiredTiger cursors with the goal of disabling cursor caching in the MongoDB layer entirely.

I ran some tests yesterday of disabling cursor caching between operations. This is orthogonal to Dan's approach above. If a long-running operation opens a cursor, that would still block a drop until the operation releases the session with my changes, but a session doesn't come from the cache with baggage that would block unrelated drops.

Hare are patch builds of microbenchmarks and sys-perf. There is a lot of red, but in the detail, many workloads slowed by 5-10%, and only a small set of tight loops (such as inserting empty documents) saw 20% or more impact.

I suspect that if we provided a parameter to control this, merging Dan's approach with this one, some customers would be prepared to pay ~10% of performance to avoid the stalls / uncertainty around drops.

Comment by Daniel Gottlieb (Inactive) [ 17/Oct/17 ]

I put together this patch as a POC for opting out of using cached sessions. There's still some ways where a client of an OperationContext can get into an undesirable state. For example, this patch still uses cursor caching on the long running operation context. What could happen, for example, if a long running operation context first accesses some (user) tables before blocking, those tables that were accessed would not be drop-able until the session is destroyed when it goes out of scope.

I don't believe any of the recently identified long running operation contexts fall into this category. One (of the few?) example of an internal thread that accesses user collections would be the TTL thread. However, in that case the OperationContext is only kept for a single pass and is outside of any sleeping (but will block getting DB/Collection locks).

diff --git a/src/mongo/db/keys_collection_manager_sharding.cpp b/src/mongo/db/keys_collection_manager_sharding.cpp
index 265dbf9e51..751a44bda2 100644
--- a/src/mongo/db/keys_collection_manager_sharding.cpp
+++ b/src/mongo/db/keys_collection_manager_sharding.cpp
@@ -217,6 +217,7 @@ void KeysCollectionManagerSharding::PeriodicRunner::_doPeriodicRefresh(
 
     while (true) {
         auto opCtx = cc().makeOperationContext();
+        opCtx->recoveryUnit()->forLongRunningOperation();
 
         bool hasRefreshRequestInitially = false;
         unsigned errorCount = 0;
diff --git a/src/mongo/db/repl/sync_source_feedback.cpp b/src/mongo/db/repl/sync_source_feedback.cpp
index dceeb7422d..03e475c5e5 100644
--- a/src/mongo/db/repl/sync_source_feedback.cpp
+++ b/src/mongo/db/repl/sync_source_feedback.cpp
@@ -151,6 +152,7 @@ void SyncSourceFeedback::run(executor::TaskExecutor* executor, BackgroundSync* b
 
     while (true) {  // breaks once _shutdownSignaled is true
         auto opCtx = cc().makeOperationContext();
+        opCtx->recoveryUnit()->forLongRunningOperation();
 
         if (keepAliveInterval == Milliseconds(0)) {
             keepAliveInterval = calculateKeepAliveInterval(opCtx.get(), _mtx);
diff --git a/src/mongo/db/repl/sync_tail.cpp b/src/mongo/db/repl/sync_tail.cpp
index 15185abaf7..959600d568 100644
--- a/src/mongo/db/repl/sync_tail.cpp
+++ b/src/mongo/db/repl/sync_tail.cpp
@@ -750,6 +750,8 @@ private:
     void run() {
         Client::initThread("ReplBatcher");
         const ServiceContext::UniqueOperationContext opCtxPtr = cc().makeOperationContext();
+        opCtxPtr->recoveryUnit()->forLongRunningOperation();
+
         OperationContext& opCtx = *opCtxPtr;
         const auto replCoord = ReplicationCoordinator::get(&opCtx);
         const auto fastClockSource = opCtx.getServiceContext()->getFastClockSource();
diff --git a/src/mongo/db/storage/recovery_unit.h b/src/mongo/db/storage/recovery_unit.h
index fe9ed47316..2b59c4310c 100644
--- a/src/mongo/db/storage/recovery_unit.h
+++ b/src/mongo/db/storage/recovery_unit.h
@@ -70,6 +70,10 @@ public:
     virtual void commitUnitOfWork() = 0;
     virtual void abortUnitOfWork() = 0;
 
+    virtual void forLongRunningOperation() {
+        return;
+    }
+
     /**
      * Waits until all commits that happened before this call are durable. Returns true, unless the
      * storage engine cannot guarantee durability, which should never happen when isDurable()
diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_record_store_mongod.cpp b/src/mongo/db/storage/wiredtiger/wiredtiger_record_store_mongod.cpp
index bb0b643cc4..802454ec3b 100644
--- a/src/mongo/db/storage/wiredtiger/wiredtiger_record_store_mongod.cpp
+++ b/src/mongo/db/storage/wiredtiger/wiredtiger_record_store_mongod.cpp
@@ -78,6 +78,7 @@ public:
         }
 
         const ServiceContext::UniqueOperationContext opCtxPtr = cc().makeOperationContext();
+        opCtxPtr->recoveryUnit()->forLongRunningOperation();
         OperationContext& opCtx = *opCtxPtr;
 
         try {
diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp b/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp
index 331b2e97d5..17ae8d50e3 100644
--- a/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp
+++ b/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp
@@ -128,6 +128,12 @@ void WiredTigerRecoveryUnit::abortUnitOfWork() {
     _abort();
 }
 
+void WiredTigerRecoveryUnit::forLongRunningOperation() {
+    invariant(!_inUnitOfWork);
+    invariant(!_session);
+    _session = UniqueWiredTigerSession(new WiredTigerSession(_sessionCache->conn()));
+}
+
 void WiredTigerRecoveryUnit::_ensureSession() {
     if (!_session) {
         _session = _sessionCache->getSession();
diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.h b/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.h
index c49adb9d1a..dea8dd238d 100644
--- a/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.h
+++ b/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.h
@@ -56,6 +56,8 @@ public:
     void commitUnitOfWork() override;
     void abortUnitOfWork() override;
 
+    void forLongRunningOperation() override;
+
     bool waitUntilDurable() override;
 
     void registerChange(Change* change) override;
diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp b/src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp
index fdffe30c5e..2eac3a376c 100644
--- a/src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp
+++ b/src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp
@@ -51,6 +51,7 @@ namespace mongo {
 WiredTigerSession::WiredTigerSession(WT_CONNECTION* conn, uint64_t epoch, uint64_t cursorEpoch)
     : _epoch(epoch),
       _cursorEpoch(cursorEpoch),
+      _cache(nullptr),
       _session(NULL),
       _cursorGen(0),
       _cursorsCached(0),
@@ -387,7 +388,11 @@ void WiredTigerSessionCache::setJournalListener(JournalListener* jl) {
 
 void WiredTigerSessionCache::WiredTigerSessionDeleter::operator()(
     WiredTigerSession* session) const {
-    session->_cache->releaseSession(session);
+    if (session->_cache) {
+        session->_cache->releaseSession(session);
+    } else {
+        delete session;
+    }
 }
 
 }  // namespace mongo

Comment by Daniel Gottlieb (Inactive) [ 17/Oct/17 ]

bruce.lucas I like the idea of letting applications from opting out of the WT session cache. Much of the machinery is already in place. The session cache is presumably most useful for client cursors. We haven't identified those to be problematic.

I believe a method on the RecoveryUnit API and its resulting small implementation for the WTRecoveryUnit would give us an alternative one line fix for each of these long-running OperationContexts.

Also looping in michael.cahill.

Comment by Bruce Lucas (Inactive) [ 17/Oct/17 ]

Looking at the linked tickets required to get rid of long-running OperationContexts it looks to me like it's going to be messy to some of them, and is possibly problematic to avoid creating more.

I think the issue on this ticket could be avoided if long-running OperationContexts obtained fresh sessions rather than using a cached session, or in some way reset the session that they got from the cache, because that would avoid holding a session that had a cached WT cursor that prevented the WT table from being dropped. Adding such an option to support long-running OperationContexts seems fairly straightforward (even if not completely trivial). Should we consider such a fix as an alternative to stamping out all current and future long-running OperationContexts?

cc: schwerin, milkie, daniel.gottlieb.

Comment by Bruce Lucas (Inactive) [ 17/Oct/17 ]

The HMAC thread mentioned above has been positively identified on SERVER-31573 as problematic.

Comment by Daniel Gottlieb (Inactive) [ 16/Oct/17 ]

Yep. I know geert.bosch and milkie wanted to have OperationContexts timed. I don't remember if a ticket was made for it.

Comment by Bruce Lucas (Inactive) [ 16/Oct/17 ]

If part of the contract of operation contexts is that they shouldn't run indefinitely, should that condition be detected - as a start maybe when operation context is destroyed if it has been running for more than some reasonable time? And reported, maybe with an abort (and test failure) during testing, and with a message in the log in production?

Comment by Daniel Gottlieb (Inactive) [ 16/Oct/17 ]

And assigning back to katherine.walker for credit!

Comment by Daniel Gottlieb (Inactive) [ 16/Oct/17 ]

Resolving this ticket as it solved the first cause of a table drop failing due to an open cursor the reproducers were exposing.

Creating follow-up tickets for the additional long-running operation contexts, and will link them here.

Comment by Daniel Gottlieb (Inactive) [ 14/Oct/17 ]

Yes, all of those should be addressed.

Comment by Bruce Lucas (Inactive) [ 14/Oct/17 ]

daniel.gottlieb, by adding some logging in OperationContext contructor and destructor to track OperationContext lifetime, I identified the following threads that appear to create long-lasting OperationContexts - created but never destroyed, even after ~5 minutes of uptime:

2017-10-14T11:00:27.503-0400 I -        [ReplBatcher] xxx OperationContext 0x5596ba2c5280
2017-10-14T11:00:28.361-0400 I -        [WT RecordStoreThread: local.oplog.rs] xxx OperationContext 0x5596ba2c67c0
2017-10-14T11:00:37.603-0400 I -        [SyncSourceFeedback] xxx OperationContext 0x5596b90d3400
2017-10-14T11:00:39.848-0400 I -        [monitoring keys for HMAC] xxx OperationContext 0x5596b9001540

One of them is ReplBatcher as you identified. If long-lasting OperationContexts can in general cause this problem, do these all need to be addressed, regardless of which one happens to be responsible for this particular repro?

Comment by Bruce Lucas (Inactive) [ 14/Oct/17 ]

I opened SERVER-31573 to track an issue with the table not being dropped on the primary; tbd relationship to this issue and SERVER-26870. There is a new repro script on SERVER-31573 that also reproduces the issue on the secondaries and detects the issue in a little more robust fashion than the script on this ticket.

Comment by Daniel Gottlieb (Inactive) [ 13/Oct/17 ]

This OperationContext from ReplBatcher seems to be the new culprit: https://github.com/mongodb/mongo/blob/8210549ab754524b155ea45eb6675d717873cb21/src/mongo/db/repl/sync_tail.cpp#L753

This one is a bit too complicated for me to POC hack, but it's pretty obvious that's a long-lasting OperationContext.

Comment by Daniel Gottlieb (Inactive) [ 13/Oct/17 ]

I gave spencer more of the details of what this fix does (explained the session and cursor caching), and how, in general, a fix might not result in the test passing. What I've done so far:

Reproduced that on master prior to katherine.walker's patch, I was getting a failure.
Reproduced that after Katherine's patch, I continued to get an error.
Reproduced that my original dirty POC (deleting the WTRecoveryUnit from the opCtx that was being held onto, returning their sessions/cursors) results in the test passing.

My next step is to the apply my "debug" patch that tracks sessions/cursors to see if the committed patch is simply unlucky and if there's another abusive consumer of OperationContexts.

FWIW: I believe Katherine's patch is an incremental improvement, even if it didn't catch this repro.

Comment by Spencer Brody (Inactive) [ 13/Oct/17 ]

So with Katherine's patch in we should no longer be waiting on the OplogFetcher with an OperationContext still in scope. So I think we need to re-visit the root cause analysis. Handing back to daniel.gottlieb for further investigation.

Comment by Bruce Lucas (Inactive) [ 13/Oct/17 ]

Yes, I'm still seeing failed table drops on the secondary.

What's more I'm also seeing failed table drops now on the primary, like with SERVER-26870 but also see that in 3.6.0-rc0 as well, so I'll open a separate ticket for that.

Comment by Spencer Brody (Inactive) [ 13/Oct/17 ]

bruce.lucas, Katherine just pushed a change that limits the lifetime of the OperationContext used by the rsBackgroundSync thread. If the root cause of this issue is indeed what daniel.gottlieb believed it to be, that should fix the issue. In Katherine's local testing based on Geert's script, however, the issue still seems to show up. Can you run your repro on master and see if it still triggers? If so, we might have to re-evaluate our root cause analysis.

Comment by Githook User [ 13/Oct/17 ]

Author:

{'email': 'katherine.walker@mongodb.com', 'name': 'Katherine Walker', 'username': 'kvwalker'}

Message: SERVER-31101 Drop WT table on secondary after collection is dropped

This reverts commit 285ea409456815f5d36b07e301328e136d04374e.
Branch: master
https://github.com/mongodb/mongo/commit/0f9d6085526921dd8edaffce827ffd07653ee4cd

Comment by Bruce Lucas (Inactive) [ 06/Oct/17 ]

spencer, that script was geert.bosch's, but I think you are correct.

I also separately provided a shell script.

Comment by Spencer Brody (Inactive) [ 06/Oct/17 ]

bruce.lucas, looking at the test you uploaded:

    jsTestLog('Wait for wiredTiger file ' + wtFileOnPrimary + ' to disappear from primary');
    assert.soon(
        () => pathExists(wtFileOnPrimary),
        'expected writeTiger collection file ' + wtFileOnPrimary + ' to disappear from primary',
        30 * 1000);

^ shouldn't that be "() => !pathExists(wtFileOnPrimary)," (checking that the path does not exist)?

Comment by Spencer Brody (Inactive) [ 26/Sep/17 ]

The problem is that bgsync keep an OperationContext around for a long time, and across network i/o. The OperationContext should be destroyed when it is not in use, and a new one created later if needed

Comment by Daniel Gottlieb (Inactive) [ 25/Sep/17 ]

My original assessment of the rsBackgroundSync thread was half-wrong, but put us in the right direction. Ultimately the problem is that the rsBackgroundSync thread, which spawns and joins the OplogFetcher thread does so while holding onto an OperationContext. This keeps the WT_SESSION from being returned to the cache and having its cursors closed via the usual mechanisms. https://github.com/mongodb/mongo/blob/fbea0197073ec8e8fb4f9588b4d3f01fd80514ca/src/mongo/db/repl/bgsync.cpp#L496

Comment by Daniel Gottlieb (Inactive) [ 25/Sep/17 ]

geert.bosch suspicion was right. There was code for closing cursors from a previous epoch. In fact there still is code to do so, but it got more sophisticated with https://github.com/mongodb/mongo/commit/b9739a9b4bf02790d427a8e4910629f8c378c5fb

Currently investigating if the patch unintentionally broke something.

Comment by Daniel Gottlieb (Inactive) [ 25/Sep/17 ]

In the meantime, I'm going to investigate a question geert.bosch had. Geert was under the impression that a failed table drop would increment the WiredTigerSessionCache epoch. What this does, is sessions being returned to the cache from the previous epoch would instead be closed.

I'm going to look and see if that code ever existed and if it accidentally got removed with some other change.

Comment by Daniel Gottlieb (Inactive) [ 25/Sep/17 ]

With the repro geert.bosch provided me, the rsBackgroundSync thread seems to be in a loop that:

  1. Gets a session from the cache
  2. Does something followed by a rollback (doing reads?)
  3. Logs a line. Some examples [js_test:server31101] 2017-09-25T11:06:44.172-0400 d20011| 2017-09-25T11:06:44.172-0400 I REPL [rsBackgroundSync] sync source candidate: chimichurri:20010 or [js_test:server31101] 2017-09-25T11:06:44.177-0400 d20011| 2017-09-25T11:06:44.177-0400 I REPL [rsBackgroundSync] scheduling fetcher to read remote oplog on chimichurri:20010 starting at filter: Timestamp(1506351978000, 2)
  4. sleeps for 1 second
  5. Releases the session

Because the session seems to released after the sleep, the following iteration of the loop immediately gets the same session back. Dropping a table attempts to close existing cursors, but it can only do so on cached sessions. Thus the drop is highly unlikely to get inbetween the iterations of the loop.

spencer Given the description of what I'm seeing, do you know off the top of your head what the rsBackgroundSync is doing that would cause this?

Comment by Bruce Lucas (Inactive) [ 19/Sep/17 ]

Here's a Linux shell script that seems to reproduce it reliably for me.

function repro {
 
    db=/ssd/db # change this as required
    uri='mongodb://localhost:27017/test?replicaSet=rs'
 
    function clean {
        killall -9 -w mongod
        rm -rf $db
    }
        
    function start {
        for i in 0 1; do 
            mkdir -p $db/r$i
            mongod --dbpath $db/r$i --logpath ./r$i.log --port 27${i}17 --replSet rs --fork
        done
    }
 
    function stop {
        killall -w mongod
    }
 
    function initiate {
        mongo --eval '
            rs.initiate({
                _id: "rs",
                 members: [
                     {_id: 0, host: "localhost:27017"},
                     {_id: 1, host: "localhost:27117"}
                ]
            })
        '
    }
 
    function check {
        sleep 2
        echo CHECKING $*
        for i in 0 1; do
            for w in collection index; do
                echo r$i $w $(ls -l $db/r$i | grep $w | wc -l)
            done
        done
    }
 
    clean; start; initiate
    mongo $uri --eval 'db.c.insert({})'
    check "after insert"
    stop
    check "after stop"
 
    start
    mongo $uri --eval 'print("drop:", db.c.drop())'
    check "after drop: expect r0 != r1"
}

Comment by Bruce Lucas (Inactive) [ 15/Sep/17 ]

This presumably has a similar cause to SERVER-26870.

Generated at Thu Feb 08 04:26:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.