[SERVER-66752] Determine if lock acquisition can happen inside of the writeConflictRetry in find_and_modify.cpp Created: 25/May/22  Updated: 27/Oct/23  Resolved: 02/Sep/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Gregory Wlodarek Assignee: Jordi Olivares Provencio
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-65418 Release all resources before sleep in... Open
Sprint: Execution Team 2022-08-22, Execution Team 2022-09-05
Participants:

 Description   

Running

resmoke --suites=concurrency jstests/concurrency/fsm_workloads/CRUD_and_commands_with_createindexes.js

with this diff

diff --git a/src/mongo/db/concurrency/exception_util.h b/src/mongo/db/concurrency/exception_util.h
index 6cfe238ca7d..63243009bac 100644
--- a/src/mongo/db/concurrency/exception_util.h
+++ b/src/mongo/db/concurrency/exception_util.h
@@ -34,6 +34,7 @@
 #include "mongo/db/operation_context.h"
 #include "mongo/util/assert_util.h"
 #include "mongo/util/fail_point.h"
+#include "mongo/util/stacktrace.h"
 
 namespace mongo {
 
@@ -124,6 +125,10 @@ auto writeConflictRetry(OperationContext* opCtx, StringData opStr, StringData ns
     int attemptsTempUnavailable = 0;
     while (true) {
         try {
+            if (opCtx->lockState()->isLocked()) {
+                printStackTrace();
+            }
+
             return f();
         } catch (WriteConflictException const&) {
             CurOp::get(opCtx)->debug().additiveMetrics.incrementWriteConflicts(1);

Shows that this writeConflictRetry is run with locks acquired at some earlier point. This is problematic because if a WriteConflictException is thrown, the locks and ticket will be held in logWriteConflictAndBackoff, which sleeps.

 



 Comments   
Comment by Jordi Olivares Provencio [ 02/Sep/22 ]

Gotcha, I'll change it to Works as Designed then. Thanks for the correction gregory.noma@mongodb.com!

Comment by Gregory Noma [ 02/Sep/22 ]

Since the intent of this ticket is to ensure that we're not calling logWriteConflictAndBackoff while holding resources from this particular writeConflictRetry in find_and_modify.cpp, if we've determined that isn't the case then can we instead close this ticket as "Works as Designed"? Sorry to be pedantic, I just want to make sure it's clear whether the outcome here is "we're sleeping while holding resources but we can't really do anything about it here" or "we're not sleeping while holding resources here". And I think "Wont Fix" implies the former whereas "Works as Designed" implies the latter.

Comment by Jordi Olivares Provencio [ 02/Sep/22 ]

I would imagine, the unlockPending counter only gets increased from 0 when restoring a WUOW or if we're in one while unlocking

Comment by Gregory Noma [ 02/Sep/22 ]

Thanks jordi.olivares-provencio@mongodb.com, does that mean that in the cases you observed we won't end up calling logWriteConflictAndBackoff because there's a higher-level WUOW (and presumably a higher-level writeConflictRetry), turning this writeConflictRetry into just a simple invocation of the function?

Comment by Jordi Olivares Provencio [ 02/Sep/22 ]

gregory.noma@mongodb.com Forgot to mention it doesn't result in failures with the reproducer. Updated the comment to reflect that.

Comment by Gregory Noma [ 02/Sep/22 ]

Adding invariant(!opCtx->lockState()->isLocked()) at the following location doesn't yield any invariant failures in master.

Running a test by just invariant-ing if we're locked yielded a large amount of errors in our tests.

jordi.olivares-provencio@mongodb.com I'm a little confused it seems like these two sentences are contradictory? Or is "invariant-ing if we're locked" referring to something else?

Comment by Jordi Olivares Provencio [ 02/Sep/22 ]

Adding invariant(!opCtx->lockState()->isLocked()) at the following location doesn't yield any invariant failures in master with the given reproducer.

Running a test by just invariant-ing if we're locked yielded a large amount of errors in our tests. Investigating jstests/replsets/tenant_migration_fetch_committed_transactions.js and dumping the locks at crash time yielded the following:

[js_test:tenant_migration_fetch_committed_transactions] d20040| 2022-09-02T11:22:17.377+00:00 I  -        20523   [conn1] "Locker status","attr":{"id":561,"requests":[{"key":"{2305843009213693953: Global, 1}","status":"granted","recursiveCount":1,"unlockPending":1,"mode":"IX"},{"key":"{2305843009213693954: Global, 2}","status":"granted","recursiveCount":1,"unlockPending":1,"mode":"IX"},{"key":"{2305843009213693955: Global, 3}","status":"granted","recursiveCount":1,"unlockPending":1,"mode":"IX"}]}

This means that they are using two phase locking and some locks will still be held at the time of calling find_and_modify. This is fundamental to how MongoDB operates. Barring some seriously deep architectural refactors I don't believe we can do anything here.

Closing the ticket as a result.

Generated at Thu Feb 08 06:06:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.