[Test Only] DropSessionsCollection hook is not compatible with checkMetadataConsistency

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Catalog and Routing
    • ALL
    • 2
    • 🟥 DDL
    • None
    • None
    • None
    • None
    • None
    • None

      The DropSessionsCollection hook can timeout with "operation cancelled" errors when running in suites that also execute tests calling checkMetadataConsistency, due to DDL lock contention.

      The current suites affected are: sharded_multi_stmt_txn_jscore_passthrough, multi_shard_multi_stmt_txn_stepdown_primary_jscore_passthrough, bulk_write_sharded_multi_stmt_txn_jscore_passthrough (and their variants)
      Test triggering issue: check_metadata_consistency_basic.js

      Root cause

      1. DropSessionsCollection._sc_block_sessions_refresh() sets the failpoint preventSessionsCollectionSharding on all CSRS nodes
      2. _sc_drop_collection() then attempts to drop config.system.sessions and call _flushRoutingTableCacheUpdatesWithWriteConcern
      3. Concurrently, checkMetadataConsistency acquires DDL locks on shards via _shardsvrCheckMetadataConsistency
      4. The drop operations conflict with CMC's DDL locks and fail with transient errors
        with_naive_retry retries for 100 seconds until timeout
      5. The failpoint is never turned off because _sc_unblock_sessions_refresh() only runs after successful drop

      We could for example ensure the failpoint is turned off in a finally block even if drop fails.

            Assignee:
            Unassigned
            Reporter:
            Marcos José Grillo Ramirez
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: