[SERVER-60335] Wait for all user operations to be killed before executing WT Rollback To Stable Created: 30/Sep/21  Updated: 13/Mar/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Josef Ahmad Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to WT-8429 Improve error handling when rollback_... Blocked
is related to SERVER-74809 Yield call into storage engine after ... Closed
is related to SERVER-58311 MongoDB calls WT RTS API while holdin... Closed
is related to SERVER-63989 Retry rollback_to_stable until all co... Closed
Assigned Teams:
Storage Execution
Sprint: Execution Team 2021-11-15, Execution Team 2021-11-29, Execution Team 2021-12-13
Participants:
Linked BF Score: 6

 Description   

SERVER-58311 surfaced a race condition whereby we call into WiredTiger Rollback To Stable (RTS) while some operations that were intended to be killed are found to be still running. This is not expected as WiredTiger usually requires RTS to have exclusive access to a dhandle; a running user operation can't guarantee that. This is observed when building the server with these non-default flags.

E  STORAGE  22435   [BackgroundSync] "WiredTiger error","attr":{"error":16,"message":"[1625743985:510865][6205:0x7f6a41a84700], file:collection-XXX.wt, txn rollback_to_stable: __rollback_to_stable_btree_apply, 1522: file:sizeStorer.wt: unable to open handle, error indicates handle is unavailable due to concurrent use: Device or resource busy"}

The problematic operation surviving the killOp seems to be a yielded oplog getmore that waits for more oplog inserts via CappedInsertNotifier. Stack at the time of failure:

#0  0x00007f3adfe33f85 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f3ae16e6705 in __gthread_cond_timedwait (__cond=0x7f3aad53f7b0, __mutex=0x7f3ab54de0b0, __abs_timeout=0x7f3a8da22ce0) at /opt/mongodbtoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../in
clude/c++/8.2.0/x86_64-mongodb-linux/bits/gthr-default.h:871
#2  0x00007f3ae17125e6 in std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f3aad53f7b0, __lock=..., __atime=...) at /opt/mongodbtoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/li
b/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/condition_variable:178
#3  0x00007f3ae1712535 in std::condition_variable::wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f3aad53f7b0, __lock=..., __atime=...) at /opt/mongodbtoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/lib/gcc/x
86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/condition_variable:106
#4  0x00007f3ade9d55fc in std::_V2::condition_variable_any::wait_until<std::unique_lock<mongo::latch_detail::Latch>, std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f3aad53f7b0, __lock=..., __atime=...) at /opt/mongod
btoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/condition_variable:286
#5  0x00007f3acac8a2bf in mongo::CappedInsertNotifier::waitUntil (this=0x7f3aad53f7b0, prevVersion=18, deadline=...) at src/mongo/db/catalog/collection.cpp:53
#6  0x00007f3ad436699a in mongo::insert_listener::waitForInserts(mongo::OperationContext*, mongo::PlanYieldPolicy*, mongo::insert_listener::CappedInsertNotifierData*)::$_1::operator()() const (this=0x7f3a8da230c0) at src/mongo/db/query/plan_insert_listener.cpp:108
#7  0x00007f3ad43667fd in std::_Function_handler<void (), mongo::insert_listener::waitForInserts(mongo::OperationContext*, mongo::PlanYieldPolicy*, mongo::insert_listener::CappedInsertNotifierData*)::$_1>::_M_invoke(std::_Any_data const&) (__functor=...) at /opt/mongodbtoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:297
#8  0x00007f3adf2c491e in std::function<void ()>::operator()() const (this=0x7f3a8da230c0) at /opt/mongodbtoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:687
#9  0x00007f3ad084066e in mongo::PlanYieldPolicy::performYield(mongo::OperationContext*, mongo::Yieldable const*, std::function<void ()>) (this=0x7f3a9ee0e380, opCtx=0x7f3aacbbc220, yieldable=0x7f3a8da24a10, whileYieldingFn=...) at src/mongo/db/query/plan_yield_policy.cpp:180
#10 0x00007f3ad0840134 in mongo::PlanYieldPolicy::yieldOrInterrupt(mongo::OperationContext*, std::function<void ()>) (this=0x7f3a9ee0e380, opCtx=0x7f3aacbbc220, whileYieldingFn=...) at src/mongo/db/query/plan_yield_policy.cpp:108
#11 0x00007f3ad4366450 in mongo::insert_listener::waitForInserts (opCtx=0x7f3aacbbc220, yieldPolicy=0x7f3a9ee0e380, notifierData=0x7f3a8da234b0) at src/mongo/db/query/plan_insert_listener.cpp:106
#12 0x00007f3ad433d3b8 in mongo::PlanExecutorImpl::_getNextImpl (this=0x7f3a94d37020, objOut=0x7f3a8da23578, dlOut=0x0) at src/mongo/db/query/plan_executor_impl.cpp:451
#13 0x00007f3ad433c598 in mongo::PlanExecutorImpl::getNextDocument (this=0x7f3a94d37020, objOut=0x7f3a94d370f0, dlOut=0x0) at src/mongo/db/query/plan_executor_impl.cpp:311
#14 0x00007f3ad433c37a in mongo::PlanExecutorImpl::getNext (this=0x7f3a94d37020, objOut=0x7f3a8da23900, dlOut=0x0) at src/mongo/db/query/plan_executor_impl.cpp:298
#15 0x00007f3abf2e9b85 in mongo::(anonymous namespace)::GetMoreCmd::Invocation::generateBatch (this=0x7f3aaee32480, opCtx=0x7f3aacbbc220, cursor=0x7f3a94d2d620, cmd=..., isTailable=true, nextBatch=0x7f3a8da243d0, numResults=0x7f3a8da23ff0, docUnitsReturned=0x7f3a8da23fd8) at src/mongo/db/commands/getmore_cmd.cpp:335
#16 0x00007f3abf2e7cf2 in mongo::(anonymous namespace)::GetMoreCmd::Invocation::acquireLocksAndIterateCursor (this=0x7f3aaee32480, opCtx=0x7f3aacbbc220, reply=0x7f3ab5446ca0, cursorManager=0x7f3ab54053a0, cursorPin=..., curOp=0x7f3a94d1a7b8) at src/mongo/db/commands/getmore_cmd.cpp:641
#17 0x00007f3abf2e5f2f in mongo::(anonymous namespace)::GetMoreCmd::Invocation::run (this=0x7f3aaee32480, opCtx=0x7f3aacbbc220, reply=0x7f3ab5446ca0) at src/mongo/db/commands/getmore_cmd.cpp:745

Leads of investigation:



 Comments   
Comment by Connie Chen [ 13/Dec/21 ]

Moving this to "Open" and to the backlog as we do not think we should do this. Decided to disable the assertion instead. 

Comment by Donald Anderson [ 19/Oct/21 ]

It looks like BF-23021 can be explained by this issue.

Comment by Louis Williams [ 15/Oct/21 ]

SERVER-55173 is a related problem with shutdown. henrik.edin you might want to discuss with yuhong.zhang if there's a common solution.

Generated at Thu Feb 08 05:49:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.