[SERVER-68661] Deadlock with transactions after step down Created: 09/Aug/22  Updated: 27/Oct/23  Resolved: 05/Sep/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Antonio Fuschetto Assignee: Kaloian Manassiev
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File deadlock.patch    
Issue Links:
Related
related to SERVER-69108 SCCL can immediately return config an... Closed
is related to SERVER-66972 Database critical section does not se... Closed
Operating System: ALL
Steps To Reproduce:

This happens approximately in 25-50% of the retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart.js test runs using the attached patch (deadlock.patch).

Participants:

 Description   

After SERVER-66972 (which delegates the refresh of the database version to another thread), the JS test retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart.js started triggering a deadlock with the prepared transactions after a node step down. The function that refreshes the local database metadata tries to X-lock on the database, but it waits indefinitely because a prepare transaction acquired and keeps the IX-lock on the same database.

Unfortunately, this is actually blocking the fix for SERVER-66972 (Database critical section does not serialize with ongoing refreshes).

 In details, the JS test runs the command below which never ends:

{  "abortTransaction" : 1,  "lsid" : {  "id" : UUID("372cc71c-e185-409d-ad2e-465e053764a4"),  "txnNumber" : NumberLong(35),  "txnUUID" : UUID("e8666f6d-e097-4ccd-ad30-608b7494aebc") },  "txnNumber" : NumberLong(0),  "autocommit" : false }

From the dump of the lock manager I see that the RefreshDbVersionThread is waiting to actually X-lock the database, while the resource is already IX-locked by another tread working on the transaction e8666f6d-e097-4ccd-ad30-608b7494aebc:

{
   "lockAddr":"0x7f21e5c89b20",
   "resourceId":"{6237343057549539649: Database, 1625657039122151745, config}",
   "granted":[
      {
         "lockRequest":"0x7aa",
         "lockRequestAddr":"0x7f21e5cb6020",
         "thread":"thread::id of a non-executing thread",
         "mode":"IX",
         "convertMode":"NONE",
         "enqueueAtFront":false,
         "compatibleFirst":false,
         "debugInfo":"lsid: { id: UUID(\"372cc71c-e185-409d-ad2e-465e053764a4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855), txnNumber: 35, txnUUID: UUID(\"e8666f6d-e097-4ccd-ad30-608b7494aebc\") }"
      }
   ],
   "pending":[
      {
         "lockRequest":"0x911",
         "lockRequestAddr":"0x7f21ec477820",
         "thread":"139783425423104",
         "mode":"X",
         "convertMode":"NONE",
         "enqueueAtFront":false,
         "compatibleFirst":false,
         "debugInfo":"",
         "clientInfo":{
            "desc":"RefreshDbVersionThread",
            "opid":2311
         }
      },
      ...
   ]
}

Additional details:

This problem looks similar (but not the same) as SERVER-62951.



 Comments   
Comment by Kaloian Manassiev [ 05/Sep/22 ]

With the changes done as part of SERVER-69108, this is no longer the case. We implemented the resolution suggested by max.hirschhorn@mongodb.com in the previous comment.

Comment by Max Hirschhorn [ 16/Aug/22 ]

antonio.fuschetto@mongodb.com, yes we can have someone from Sharding NYC take a look. From what I understood with the {_flushDatabaseCacheUpdates: "config"} command that Sharding EMEA has introduced is that we're attempting to take a MODE_X lock on the config database. The liveness of the server around transactions depends on the config database X lock never being taken. When the retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart.js test was introduced the changes in 7eb9325 as part of SERVER-63071, it also needed to fix how creating the config.system.indexBuilds collection was being done under the config database X lock to instead only acquire the config.system.indexBuilds collection X lock.

I suspect the resolution here will be that {_flushDatabaseCacheUpdates: "config"} must be special-cased in some way.

Comment by Max Hirschhorn [ 09/Aug/22 ]

I had mentioned to Antonio over Slack that the reason for the prepared transaction holding the config database lock in MODE_IX is due to the postImage document of the findAndModify command being stored in the config.image_collection collection.

It looks like the _flushDatabaseCacheUpdates is being triggered by the TTL monitor for the config.system.sessions collection. In my mind this is unnecessary because (1) the config.system.sessions collection is always sharded and (2) the primary shard for the config database is the config server and so the database version is meaningless for the shards.

[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] Test when an old secondary steps up
...
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] Testing with options {
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] 	"runFindAndModifyWithPreOrPostImage" : true,
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] 	"abortTxnAfterFailover" : true,
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] 	"enableFindAndModifyImageCollection" : true
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] }
...
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20046| 2022-08-08T21:26:22.683+00:00 I  NETWORK  4712102 [ShardServerCatalogCacheLoader::getDatabase] "Host failed in replica set","attr":{"replicaSet":"retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart-rs0","host":"ip-10-128-170-164.ec2.internal:20045","error":{"code":202,"codeName":"NetworkInterfaceExceededTimeLimit","errmsg":"Request 263 timed out, deadline was 2022-08-08T21:26:22.682+00:00, op was RemoteCommand 263 -- target:[ip-10-128-170-164.ec2.internal:20045] db:admin expDate:2022-08-08T21:26:22.682+00:00 cmd:{ _flushDatabaseCacheUpdates: \"config\", maxTimeMS: 30000 }"},"action":{"dropConnections":false,"requestImmediateCheck":false,"outcome":{"host":"ip-10-128-170-164.ec2.internal:20045","success":false,"errorMessage":"NetworkInterfaceExceededTimeLimit: Request 263 timed out, deadline was 2022-08-08T21:26:22.682+00:00, op was RemoteCommand 263 -- target:[ip-10-128-170-164.ec2.internal:20045] db:admin expDate:2022-08-08T21:26:22.682+00:00 cmd:{ _flushDatabaseCacheUpdates: \"config\", maxTimeMS: 30000 }"}}}
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20046| 2022-08-08T21:26:22.683+00:00 I  SHARDING 22739   [ShardServerCatalogCacheLoader::getDatabase] "Operation timed out","attr":{"error":"NetworkInterfaceExceededTimeLimit: Request 263 timed out, deadline was 2022-08-08T21:26:22.682+00:00, op was RemoteCommand 263 -- target:[ip-10-128-170-164.ec2.internal:20045] db:admin expDate:2022-08-08T21:26:22.682+00:00 cmd:{ _flushDatabaseCacheUpdates: \"config\", maxTimeMS: 30000 }"}
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20046| 2022-08-08T21:26:22.685+00:00 I  SHARDING 22062   [TTLShardVersionRecovery] "Failed to refresh metadata for collection","attr":{"namespace":"config.system.sessions","error":"NetworkInterfaceExceededTimeLimit: Request 263 timed out, deadline was 2022-08-08T21:26:22.682+00:00, op was RemoteCommand 263 -- target:[ip-10-128-170-164.ec2.internal:20045] db:admin expDate:2022-08-08T21:26:22.682+00:00 cmd:{ _flushDatabaseCacheUpdates: \"config\", maxTimeMS: 30000 }"}
...
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20045| 2022-08-08T21:26:22.716+00:00 I  SHARDING 22065   [conn61] "Failed to refresh databaseVersion","attr":{"db":"config","error":"MaxTimeMSExpired: operation exceeded time limit"}
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20045| 2022-08-08T21:26:22.717+00:00 W  COMMAND  20525   [conn61] "Failed to gather storage statistics for slow operation","attr":{"opId":2233,"error":"lock acquire timeout"}
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20045| 2022-08-08T21:26:22.717+00:00 I  COMMAND  51803   [conn61] "Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"/data/mci/708ab6bfed5f2e904c1261140d186eb9/src/dist-test/bin/mongod","command":{"_flushDatabaseCacheUpdates":"config","maxTimeMS":30000,"maxTimeMSOpOnly":30000,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1659993952,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1659993938,"i":1}},"$topologyTime":{"$timestamp":{"t":1659993907,"i":3}},"mayBypassWriteBlocking":false,"$db":"admin"},"databaseVersionRefreshDurationMillis":30029,"numYields":0,"ok":0,"errMsg":"operation exceeded time limit","errName":"MaxTimeMSExpired","errCode":50,"reslen":290,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":2}},"FeatureCompatibilityVersion":{"acquireCount":{"r":2}},"ReplicationStateTransition":{"acquireCount":{"w":2}},"Global":{"acquireCount":{"r":2}},"Database":{"acquireCount":{"r":2}},"Mutex":{"acquireCount":{"r":1,"W":1}}},"remote":"10.128.170.164:36854","protocol":"op_msg","durationMillis":30030}

https://evergreen.mongodb.com/lobster/build/3cedf42d3184984b2a8e1a3d44d5e84d/test/62f17ee9f966b7eb700ec159#bookmarks=0%2C6909%2C15203%2C15204%2C15205%2C15206%2C18259%2C18264%2C108341&f~=110~%28REPL_HB%7CELECTION%29&f~=100~d20045%5C%7C&l=1

I wonder if the Checkpointer thread repeatedly timing out on the config database MODE_IS lock acquisition while calling TransactionParticipant::getOldestActiveTimestamp() is preventing the majority-commit point from advancing and explains why the test client hasn't run the abortTransaction command.

[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20045| 2022-08-08T21:26:54.825+00:00 D2 RECOVERY 23986   [Checkpointer] "Performing stable checkpoint.","attr":{"stableTimestamp":{"$timestamp":{"t":1659993952,"i":2}},"oplogNeededForRollback":"LockTimeout: Unable to acquire IS lock on '{6237343057549539649: Database, 1625657039122151745, config}' within 100ms. opId: 3369, op: OldestActiveTxnTimestamp, connId: 0."}

Generated at Thu Feb 08 06:11:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.