|
I had mentioned to Antonio over Slack that the reason for the prepared transaction holding the config database lock in MODE_IX is due to the postImage document of the findAndModify command being stored in the config.image_collection collection.
It looks like the _flushDatabaseCacheUpdates is being triggered by the TTL monitor for the config.system.sessions collection. In my mind this is unnecessary because (1) the config.system.sessions collection is always sharded and (2) the primary shard for the config database is the config server and so the database version is meaningless for the shards.
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] Test when an old secondary steps up
|
...
|
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] Testing with options {
|
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] "runFindAndModifyWithPreOrPostImage" : true,
|
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] "abortTxnAfterFailover" : true,
|
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] "enableFindAndModifyImageCollection" : true
|
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] [jsTest] }
|
...
|
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20046| 2022-08-08T21:26:22.683+00:00 I NETWORK 4712102 [ShardServerCatalogCacheLoader::getDatabase] "Host failed in replica set","attr":{"replicaSet":"retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart-rs0","host":"ip-10-128-170-164.ec2.internal:20045","error":{"code":202,"codeName":"NetworkInterfaceExceededTimeLimit","errmsg":"Request 263 timed out, deadline was 2022-08-08T21:26:22.682+00:00, op was RemoteCommand 263 -- target:[ip-10-128-170-164.ec2.internal:20045] db:admin expDate:2022-08-08T21:26:22.682+00:00 cmd:{ _flushDatabaseCacheUpdates: \"config\", maxTimeMS: 30000 }"},"action":{"dropConnections":false,"requestImmediateCheck":false,"outcome":{"host":"ip-10-128-170-164.ec2.internal:20045","success":false,"errorMessage":"NetworkInterfaceExceededTimeLimit: Request 263 timed out, deadline was 2022-08-08T21:26:22.682+00:00, op was RemoteCommand 263 -- target:[ip-10-128-170-164.ec2.internal:20045] db:admin expDate:2022-08-08T21:26:22.682+00:00 cmd:{ _flushDatabaseCacheUpdates: \"config\", maxTimeMS: 30000 }"}}}
|
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20046| 2022-08-08T21:26:22.683+00:00 I SHARDING 22739 [ShardServerCatalogCacheLoader::getDatabase] "Operation timed out","attr":{"error":"NetworkInterfaceExceededTimeLimit: Request 263 timed out, deadline was 2022-08-08T21:26:22.682+00:00, op was RemoteCommand 263 -- target:[ip-10-128-170-164.ec2.internal:20045] db:admin expDate:2022-08-08T21:26:22.682+00:00 cmd:{ _flushDatabaseCacheUpdates: \"config\", maxTimeMS: 30000 }"}
|
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20046| 2022-08-08T21:26:22.685+00:00 I SHARDING 22062 [TTLShardVersionRecovery] "Failed to refresh metadata for collection","attr":{"namespace":"config.system.sessions","error":"NetworkInterfaceExceededTimeLimit: Request 263 timed out, deadline was 2022-08-08T21:26:22.682+00:00, op was RemoteCommand 263 -- target:[ip-10-128-170-164.ec2.internal:20045] db:admin expDate:2022-08-08T21:26:22.682+00:00 cmd:{ _flushDatabaseCacheUpdates: \"config\", maxTimeMS: 30000 }"}
|
...
|
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20045| 2022-08-08T21:26:22.716+00:00 I SHARDING 22065 [conn61] "Failed to refresh databaseVersion","attr":{"db":"config","error":"MaxTimeMSExpired: operation exceeded time limit"}
|
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20045| 2022-08-08T21:26:22.717+00:00 W COMMAND 20525 [conn61] "Failed to gather storage statistics for slow operation","attr":{"opId":2233,"error":"lock acquire timeout"}
|
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20045| 2022-08-08T21:26:22.717+00:00 I COMMAND 51803 [conn61] "Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"/data/mci/708ab6bfed5f2e904c1261140d186eb9/src/dist-test/bin/mongod","command":{"_flushDatabaseCacheUpdates":"config","maxTimeMS":30000,"maxTimeMSOpOnly":30000,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1659993952,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1659993938,"i":1}},"$topologyTime":{"$timestamp":{"t":1659993907,"i":3}},"mayBypassWriteBlocking":false,"$db":"admin"},"databaseVersionRefreshDurationMillis":30029,"numYields":0,"ok":0,"errMsg":"operation exceeded time limit","errName":"MaxTimeMSExpired","errCode":50,"reslen":290,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":2}},"FeatureCompatibilityVersion":{"acquireCount":{"r":2}},"ReplicationStateTransition":{"acquireCount":{"w":2}},"Global":{"acquireCount":{"r":2}},"Database":{"acquireCount":{"r":2}},"Mutex":{"acquireCount":{"r":1,"W":1}}},"remote":"10.128.170.164:36854","protocol":"op_msg","durationMillis":30030}
|
https://evergreen.mongodb.com/lobster/build/3cedf42d3184984b2a8e1a3d44d5e84d/test/62f17ee9f966b7eb700ec159#bookmarks=0%2C6909%2C15203%2C15204%2C15205%2C15206%2C18259%2C18264%2C108341&f~=110~%28REPL_HB%7CELECTION%29&f~=100~d20045%5C%7C&l=1
I wonder if the Checkpointer thread repeatedly timing out on the config database MODE_IS lock acquisition while calling TransactionParticipant::getOldestActiveTimestamp() is preventing the majority-commit point from advancing and explains why the test client hasn't run the abortTransaction command.
[js_test:retryable_findAndModify_commit_and_abort_prepared_txns_after_failover_and_restart] d20045| 2022-08-08T21:26:54.825+00:00 D2 RECOVERY 23986 [Checkpointer] "Performing stable checkpoint.","attr":{"stableTimestamp":{"$timestamp":{"t":1659993952,"i":2}},"oplogNeededForRollback":"LockTimeout: Unable to acquire IS lock on '{6237343057549539649: Database, 1625657039122151745, config}' within 100ms. opId: 3369, op: OldestActiveTxnTimestamp, connId: 0."}
|
|