-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Catalog and Routing
-
ALL
-
2
-
🟥 DDL
-
None
-
None
-
None
-
None
-
None
-
None
The DropSessionsCollection hook can timeout with "operation cancelled" errors when running in suites that also execute tests calling checkMetadataConsistency, due to DDL lock contention.
The current suites affected are: sharded_multi_stmt_txn_jscore_passthrough, multi_shard_multi_stmt_txn_stepdown_primary_jscore_passthrough, bulk_write_sharded_multi_stmt_txn_jscore_passthrough (and their variants)
Test triggering issue: check_metadata_consistency_basic.js
Root cause
- DropSessionsCollection._sc_block_sessions_refresh() sets the failpoint preventSessionsCollectionSharding on all CSRS nodes
- _sc_drop_collection() then attempts to drop config.system.sessions and call _flushRoutingTableCacheUpdatesWithWriteConcern
- Concurrently, checkMetadataConsistency acquires DDL locks on shards via _shardsvrCheckMetadataConsistency
- The drop operations conflict with CMC's DDL locks and fail with transient errors
with_naive_retry retries for 100 seconds until timeout - The failpoint is never turned off because _sc_unblock_sessions_refresh() only runs after successful drop
We could for example ensure the failpoint is turned off in a finally block even if drop fails.