|
Author:
{'username': 'sulabhM', 'name': 'Sulabh Mahajan', 'email': 'sulabh.mahajan@mongodb.com'}
Message: SERVER-35473 Fix writeconflict issues with cursor restarts
(cherry picked from commit b807f350638757c8833d82fec14a90d7b3a03051)
Branch: v4.0
https://github.com/mongodb/mongo/commit/a37441baa3d5d3dc6f70c7b0d7825f9dbbc90b3f
|
|
The final fix involved the following:
- Index and Collection cursors restart were not always taking a new transaction. Change made makes sure that if not under a transaction already, a read transaction is started on cursor restart
- Cursor restarts using save() / restore() were calling SQLite reset, which was not sufficient to give up read lock on the snapshot being operated on. Change made calls finalize on the sqlite statement on saving the state and re-prepares the statement on doing a restore. This releases the snapshot and acquires a new one on restarting the query
- MapReduce at the finalReduce stage after running a query into the storage engine through the planner was not destroying the planner context and hence not calling save/restore on the underlying cursors. Change made fixes that by destroying the context when not needed.
|
|
Author:
{'username': 'sulabhM', 'name': 'Sulabh Mahajan', 'email': 'sulabh.mahajan@mongodb.com'}
Message: SERVER-35473 Fix writeconflict issues with cursor restarts
Branch: master
https://github.com/mongodb/mongo/commit/b807f350638757c8833d82fec14a90d7b3a03051
|
Do you suspect the problem lies in the way that SqliteStatement is using prepared statements to implement index scans?
milkie, that is partly the reason.
I have figured the following is happening with SQLite which was sort of unexpected:
- Cursor read operation (ie preparing and stepping the select statements) takes a read lock on an implicit snapshot.
- The lifetime of the above snapshot and of the read lock is not tied up with the transaction lifetime.
- I think this is because BEGIN on a transaction doesn't do anything really, it is the statement prepare/step that takes this read lock. Similarly a ROLLBACK doesn't release the lock and the snapshot.
- We do not see this problem with WiredTiger because WT resets all the cursors on a rollback, which isn't true for SQLite. Keeping a cursor running means on restarting the query, the subsequent BEGIN doesn't get a new snapshot, continues with the older snapshot and keeping it read locked throughout the yield operation.
While root causing this bug, I also noticed that with WiredTiger when global lock is given up and abandonSnapshot gets called some of the cursor operations on the collections are not save/restored. I don't see how it can work if that is the case because a rollback will reset cursors in WiredTiger and the subsequent next should fail or give the wrong record! Likely WiredTiger is doing the right thing and I haven't figured out all of it yet. But this ticket doesn't need to wait for that.
I tested the following out and it worked:
- All (index and collection) cursors need to finalize (sqlite3_finalize()) their respective cursor statements on save (or restore). Maybe a statement reset (sqlite3_reset()) will do the job of giving up the read lock on the snapshot too? I need to check the reset call.
- Save continues to keep track of where the cursor position is.
- Restore needs to re-create the sqlite statement and position it at the saved location.
- Restore also needs to re-start a read transaction using a BEGIN statement if not already in a transaction.
- The following step shouldn't be needed if higher layers make sure that a cursor save/restore gets called each time there is a yield or when reads are restarted across a transaction: Need to maintain a list of all cursors open in a session/transaction. On closing a transaction we have to save the cursor state and call finalize (or reset ?) on the sqlite statements associated with these cursors. On starting the transaction back we recreate the statements and restore their state.
Update: A sqlite3_reset() on the statement doesn't work towards giving up that read lock on snapshot.
|
|
milkie, I don't have an answer yet, I am working on it, will get back to you soon.
|
|
I understand your description up until this point:
At this point the cursor query starts again, no read transaction is started this time and the query holds a read lock on the snapshot it was operating on
How can the cursor be advanced in the query if no read transaction is active? To me, the read lock protects the existence of an open transaction, and the transaction has the snapshot. If there is no transaction, there is no snapshot.
Do you suspect the problem lies in the way that SqliteStatement is using prepared statements to implement index scans?
|
|
For reference, here is the gdb trace when the cursor query is restarted and stepped again after abandonSnapshot() has rolled-back the read transaction (I have put an invariant to catch this):
(gdb) bt
|
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
|
#1 0x00007f87710baf5d in __GI_abort () at abort.c:90
|
#2 0x00005581161cec3e in (anonymous namespace)::invariantFailed (expr=0x558116459c3d "_session->_txnActive", file=0x558116459bc4 "src/mongo/db/storage/mobile/mobile_sqlite_statement.cpp", line=109) at src/mongo/util/assert_util.cpp:104
|
#3 0x0000558113957d5c in (anonymous namespace)::invariantWithLocation<bool> (testOK=@0x7f876b33d4c8: false, expr=0x558116459c3d "_session->_txnActive", file=0x558116459bc4 "src/mongo/db/storage/mobile/mobile_sqlite_statement.cpp", line=109) at src/mongo/util/invariant.h:67
|
#4 0x00005581139b7b31 in (anonymous namespace)::SqliteStatement::step (this=0x7f875da8c980, desiredStatus=-1) at src/mongo/db/storage/mobile/mobile_sqlite_statement.cpp:109
|
#5 0x0000558113999719 in (anonymous namespace)::(anonymous namespace)::CursorBase::_advance (this=0x7f875da8ed20) at src/mongo/db/storage/mobile/mobile_index.cpp:503
|
#6 0x0000558113999c94 in (anonymous namespace)::(anonymous namespace)::CursorBase::_doSeek (this=0x7f875da8ed20) at src/mongo/db/storage/mobile/mobile_index.cpp:565
|
#7 0x00005581139992be in (anonymous namespace)::(anonymous namespace)::CursorBase::restore (this=0x7f875da8ed20) at src/mongo/db/storage/mobile/mobile_index.cpp:482
|
#8 0x0000558114464605 in (anonymous namespace)::IndexScan::doRestoreState (this=0x7f875e001020) at src/mongo/db/exec/index_scan.cpp:254
|
#9 0x0000558114482eec in (anonymous namespace)::PlanStage::restoreState (this=0x7f875e001020) at src/mongo/db/exec/plan_stage.cpp:74
|
#10 0x0000558114482eca in (anonymous namespace)::PlanStage::restoreState (this=0x7f876b6f5c20) at src/mongo/db/exec/plan_stage.cpp:71
|
#11 0x000055811450fe91 in (anonymous namespace)::PlanExecutor::restoreStateWithoutRetrying (this=0x7f876b6f6420) at src/mongo/db/query/plan_executor.cpp:373
|
#12 0x000055811451d5a7 in (anonymous namespace)::PlanYieldPolicy::yield((anonymous namespace)::function<void ()>, (anonymous namespace)::function<void ()>) (this=0x7f875de437a0, beforeYieldingFn=..., whileYieldingFn=...) at src/mongo/db/query/plan_yield_policy.cpp:165
|
#13 0x000055811451ce7e in (anonymous namespace)::PlanYieldPolicy::yieldOrInterrupt((anonymous namespace)::function<void ()>, (anonymous namespace)::function<void ()>) (this=0x7f875de437a0, beforeYieldingFn=..., whileYieldingFn=...) at src/mongo/db/query/plan_yield_policy.cpp:117
|
#14 0x000055811451c9e3 in (anonymous namespace)::PlanYieldPolicy::yieldOrInterrupt (this=0x7f875de437a0, recordFetcher=0x0) at src/mongo/db/query/plan_yield_policy.cpp:95
|
#15 0x000055811451074f in (anonymous namespace)::PlanExecutor::getNextImpl (this=0x7f876b6f6420, objOut=0x7f8764011da0, dlOut=0x0) at src/mongo/db/query/plan_executor.cpp:548
|
#16 0x0000558114510121 in (anonymous namespace)::PlanExecutor::getNext (this=0x7f876b6f6420, objOut=0x7f8764012400, dlOut=0x0) at src/mongo/db/query/plan_executor.cpp:408
|
#17 0x00005581140b6313 in (anonymous namespace)::(anonymous namespace)::MapReduceCommand::errmsgRun (this=0x558117c0b990 <mongo::mr::mapReduceCommand>, opCtx=0x7f87646213a0, dbname=..., cmd=..., errmsg=..., result=...) at src/mongo/db/commands/mr.cpp:1513
|
#18 0x000055811597edb5 in (anonymous namespace)::ErrmsgCommandDeprecated::run (this=0x558117c0b990 <mongo::mr::mapReduceCommand>, opCtx=0x7f87646213a0, db=..., cmdObj=..., result=...) at src/mongo/db/commands.cpp:546
|
#19 0x0000558115980809 in (anonymous namespace)::BasicCommand::Invocation::run (this=0x7f876b4364a0, opCtx=0x7f87646213a0, result=0x7f8764013448) at src/mongo/db/commands.cpp:449
|
#20 0x0000558113ac656f in (anonymous namespace)::(anonymous namespace)::invokeInTransaction (opCtx=0x7f87646213a0, invocation=0x7f876b4364a0, replyBuilder=0x7f8764013448) at src/mongo/db/service_entry_point_common.cpp:467
|
#21 0x0000558113ac48e2 in (anonymous namespace)::(anonymous namespace)::runCommandImpl (opCtx=0x7f87646213a0, invocation=0x7f876b4364a0, request=..., replyBuilder=0x7f876b955140, startOperationTime=..., behaviors=..., extraFieldsBuilder=0x7f87640143f0, sessionOptions=...)
|
at src/mongo/db/service_entry_point_common.cpp:542
|
|
|
After a lot of debugging I have a theory on why this is happening:
- BEGIN EXCLUSIVE takes a write lock to begin an exclusive transaction, sqlite3WalBeginWriteTransaction() is the function responsible for taking this write lock. It fails with SQLITE_BUSY_SNAPSHOT implying that there was a read lock already, which was attempted to be promoted to a write lock but it failed because another write transaction updated the database in the meanwhile.
- Some amount of gdb got me the running read query that might have taken a read lock earlier:
$3 = {db = 0x7f321942a828, pPrev = 0x7f32194fa8c8, pNext = 0x0, pParse = 0x7f321d22e578, nVar = 1, magic = 770837923, nMem = 6, nCursor = 3, cacheCtr = 3, pc = 12, rc = 0, nChange = 0, iStatement = 0, iCurrentTime = 0, nFkConstraint = 0, nStmtDefCons = 139853149675024, nStmtDefImmCons = 3, aOp = 0x7f32194ef6a8,
|
aMem = 0x7f32194efa08, apArg = 0x7f32194ef9d0, aColName = 0x7f32194fa418, pResultSet = 0x7f32194efa78, zErrMsg = 0x0, apCsr = 0x7f32194ef9b8, aVar = 0x7f32194ef9d0, pVList = 0x0, startTime = 0, nOp = 16, nResColumn = 2, errorAction = 2 '\002', minWriteFileFormat = 255 '\377', expired = 0, doingRerun = 0,
|
explain = 0, changeCntOn = 0, runOnlyOnce = 0, usesStmtJournal = 0, readOnly = 1, bIsReader = 1, isPrepareV2 = 1, btreeMask = 1, lockMask = 0, aCounter = {0, 0, 0, 0, 560}, zSql = 0x7f32194f1778 "SELECT rec_id, data from \"collection-18--5597462496437341890\" WHERE rec_id > ? ORDER BY rec_id ASC;", pFree = 0x0,
|
pFrame = 0x0, pDelFrame = 0x0, nFrame = 0, expmask = 0, pProgram = 0x0, pAuxData = 0x0}
|
- The query is: SELECT rec_id, data from \"collection-18--5597462496437341890\" WHERE rec_id > ? ORDER BY rec_id ASC;
This seems to be a cursor query iterating a collection. For some reason this query is being kept open holding a read lock on an older snapshot.
I located the query in the code and added instrumentation to follow it's lifetime and discovered the following:
- The cursor operation starts because mapreduce code is following a query plan to iterate over all the docs, in src/mongo/db/commands/mr.cpp:
1510 // go through each doc
|
1511 BSONObj o;
|
1512 PlanExecutor::ExecState execState;
|
1513 while (PlanExecutor::ADVANCED == (execState = exec->getNext(&o, NULL))) {
|
1514 o = o.getOwned(); // we will be accessing outside of the lock
|
1515 // check to see if this is a new object we don't own yet
|
1516 // because of a chunk migration
|
1517 if (collMetadata) {
|
1518 ShardKeyPattern kp(collMetadata->getKeyPattern());
|
1519 if (!collMetadata->keyBelongsToMe(kp.extractShardKeyFromDoc(o))) {
|
1520 continue;
|
1521 }
|
1522 }
|
- At some point the query planner yields, in PlanYieldPolicy::yield() which calls QueryYield::yieldAllLocks():
// static
|
void QueryYield::yieldAllLocks(OperationContext* opCtx,
|
stdx::function<void()> whileYieldingFn,
|
const NamespaceString& planExecNS) {
|
// Things have to happen here in a specific order:
|
// * Release lock mgr locks
|
// * Go to sleep
|
// * Call the whileYieldingFn
|
// * Reacquire lock mgr locks
|
|
Locker* locker = opCtx->lockState();
|
|
Locker::LockSnapshot snapshot;
|
|
// Nothing was unlocked, just return, yielding is pointless.
|
if (!locker->saveLockStateAndUnlock(&snapshot)) {
|
return;
|
}
|
|
// Top-level locks are freed, release any potential low-level (storage engine-specific
|
// locks). If we are yielding, we are at a safe place to do so.
|
opCtx->recoveryUnit()->abandonSnapshot();
|
- When the cursor query started it obtained a SQLite session and a read transaction was started implicitly. When the above code calls abandonSnapshot() this read transaction is rolled back.
- Eventually yield() operation restarts the query with a call to restoreStateWithoutRetrying() :
Status PlanYieldPolicy::yield(stdx::function<void()> beforeYieldingFn,
|
stdx::function<void()> whileYieldingFn) {
|
. . .
|
// Can't use writeConflictRetry since we need to call saveState before reseting the transaction.
|
for (int attempt = 1; true; attempt++) {
|
try {
|
. . .
|
if (_policy == PlanExecutor::WRITE_CONFLICT_RETRY_ONLY) {
|
. . .
|
} else {
|
// Release and reacquire locks.
|
if (beforeYieldingFn)
|
beforeYieldingFn();
|
QueryYield::yieldAllLocks(opCtx, whileYieldingFn, _planYielding->nss());
|
}
|
|
return _planYielding->restoreStateWithoutRetrying();
|
} catch (const WriteConflictException&) {
|
. . .
|
}
|
- At this point the cursor query starts again, no read transaction is started this time and the query holds a read lock on the snapshot it was operating on
So effectively at this point we have a long running cursor query operating outside a read transaction holding an implicit read lock on an earlier snapshot. Another session comes up and makes a successful write, making the query's snapshot old. Now when the session with the running query tries to start a write transaction it fails because the promotion of read to write lock on older snapshot is not allowed for a write.
|
|
I would add logging to the top of the SQLite layer in our code to examine all interaction with SQLite. Log all the beginning and ends of SQL statements sent to SQLite. Hopefully that will show some other operation still in progress when we are in finalReduce().
My guess is that some internal SQL operation done by the mobile storage engine code itself is the culprit, which would escape the lock checking you put in.
|
|
The operation that gets stuck in infinite write conflict loop is State::insert in src/mongo/db/commands/mr.cpp.
Here is a part of the stack trace:
Thread 16 (Thread 0x7f4f0f7a3700 (LWP 20188)):
|
#0 0x00007f4f1b5ad5f8 in __GI___nanosleep (requested_time=0x7f4f0f79f5b0, remaining=0x7f4f0f79f5a0) at ../sysdeps/unix/sysv/linux/nanosleep.c:27
|
#1 0x000055a4e3931033 in (anonymous namespace)::(anonymous namespace)::(anonymous namespace)::sleep_for<long, std::ratio<1, 1000000000> > (sleep_duration=...) at src/mongo/stdx/thread.h:130
|
#2 (anonymous namespace)::sleepmillis (s=<optimised out>) at src/mongo/util/time_support.cpp:728
|
#3 0x000055a4e31b9445 in (anonymous namespace)::WriteConflictException::logAndBackoff (attempt=3281, operation=..., ns=...) at src/mongo/db/concurrency/write_conflict_exception.cpp:61
|
#4 0x000055a4e25b883e in (anonymous namespace)::writeConflictRetry<(lambda at src/mongo/db/commands/mr.cpp:786:56)> (opCtx=<optimised out>, opStr=..., ns=..., f=...) at src/mongo/db/concurrency/write_conflict_exception.h:94
|
#5 (anonymous namespace)::(anonymous namespace)::State::insert (this=<optimised out>, nss=..., o=...) at src/mongo/db/commands/mr.cpp:786
|
#6 0x000055a4e25b9ed7 in (anonymous namespace)::(anonymous namespace)::State::finalReduce (this=0x7f4f0f7a0220, values=...) at src/mongo/db/commands/mr.cpp:1049
|
#7 0x000055a4e25baf2f in (anonymous namespace)::(anonymous namespace)::State::finalReduce (this=<optimised out>, opCtx=<optimised out>, curOp=<optimised out>, pm=...) at src/mongo/db/commands/mr.cpp:1214
|
#8 0x000055a4e25c200c in (anonymous namespace)::(anonymous namespace)::MapReduceCommand::errmsgRun (this=<optimised out>, opCtx=0x7f4f0f79f5a0, dbname=..., cmd=..., errmsg=..., result=...) at src/mongo/db/commands/mr.cpp:1610
|
#9 0x000055a4e33fa9d4 in (anonymous namespace)::ErrmsgCommandDeprecated::run (this=0x7f4f0f79f5b0, opCtx=0x7f4f0f79f5a0, db=..., cmdObj=..., result=...) at src/mongo/db/commands.cpp:546
|
#10 0x000055a4e33fb3ac in (anonymous namespace)::BasicCommand::Invocation::run (this=0x55a4e66eee40, opCtx=0x55a4e6dd3000, result=<optimised out>) at src/mongo/db/commands.cpp:449
|
#11 0x000055a4e21ff32b in (anonymous namespace)::(anonymous namespace)::invokeInTransaction (opCtx=0x55a4e6dd3000, invocation=0x55a4e66eee40, replyBuilder=0x7f4f0f7a09a0) at src/mongo/db/service_entry_point_common.cpp:464
|
#12 0x000055a4e21faf70 in (anonymous namespace)::(anonymous namespace)::runCommandImpl (opCtx=0x55a4e6dd3000, invocation=0x55a4e66eee40, request=..., replyBuilder=0x55a4e6462dd0, behaviors=..., sessionOptions=..., startOperationTime=...,
|
extraFieldsBuilder=<optimised out>) at src/mongo/db/service_entry_point_common.cpp:539
|
#13 (anonymous namespace)::(anonymous namespace)::execCommandDatabase (opCtx=<optimised out>, command=0x55a4e495bd40 <mongo::mr::mapReduceCommand>, request=..., replyBuilder=<optimised out>, behaviors=...)
|
at src/mongo/db/service_entry_point_common.cpp:875
|
#14 0x000055a4e21f34c8 in operator() (this=<optimised out>) at src/mongo/db/service_entry_point_common.cpp:1022
|
#15 (anonymous namespace)::(anonymous namespace)::runCommands (opCtx=<optimised out>, message=..., behaviors=...) at src/mongo/db/service_entry_point_common.cpp:970
|
See the attached file for full backtrace and for all active threads.
While doing a "BEGIN EXCLUSIVE" SQLite keeps throwing a BUSY error. Some gdb debugging inside the SQLite code gave me more pointers to what is going on:
I don't understand why will there be a read active at this stage. The asserts that I have put that write should have taken a global write lock and reads should have taken global read locks didn't trigger.
milkie can you suggest how to proceed.
|
|
Map reduce currently gets into a infinite write conflict loop while trying to start an exclusive transaction. To reproduce I did the following:
Limit the concurrency test suite to a single test, map_reduce_replace.js
diff --git a/buildscripts/resmokeconfig/suites/concurrency.yml b/buildscripts/resmokeconfig/suites/concurrency.yml
|
index 80f2f12595..48dbab3c67 100644
|
--- a/buildscripts/resmokeconfig/suites/concurrency.yml
|
+++ b/buildscripts/resmokeconfig/suites/concurrency.yml
|
@@ -2,7 +2,7 @@ test_kind: fsm_workload_test
|
|
selector:
|
roots:
|
- - jstests/concurrency/fsm_workloads/**/*.js
|
+ - jstests/concurrency/fsm_workloads/**/map_reduce_replace.js
|
exclude_with_any_tags:
|
- uses_transactions
|
- requires_replication
|
Now I am running the test suite as follows:
buildscripts/resmoke.py --storageEngine=mobile --dbpathPrefix=/tmp/delme --suites=concurrency
|
|
|
Status:
We have to consider if we need map-reduce on mobile. Need to evaluate the cost of finding the bug and fixing it compared to disabling map-reduce for mobile.
|
Generated at Thu Feb 08 04:39:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.