WiredTiger Eviction Stalling In Disagg

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Engines, Storage Engines - Transactions
    • None
    • None

      Summary

      WiredTiger normally evicts pages when there is cache pressure. However, it seems like in DSC, WiredTiger is unable to evict pages for a transaction that is causing cache pressure. This then causes application threads to stall because eviction is seemingly unable to make progress or is very slow.

      This issue was noticed by running transaction_too_large_for_cache.js  under the no_passthrough_disagg_override.yml suite which ends up hanging for 30 minutes (see patch run). The test:

      • Inserts larges documents in a transaction repeatedly to fill up the cache (here)
      • Eventually it expects WiredTiger to throw a rollback error due to cache pressure, that mongod then bubbles up back to the user (here)

      However, from the stack traces it seems like the transaction operation thread is stuck waiting for eviction to occur.

      [2026/02/25 16:28:31.086] Thread 7 (Thread 0xffff76a430c0 (LWP 167429)):
      [2026/02/25 16:28:31.086] #0  0x0000ffffa7122794 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
      [2026/02/25 16:28:31.086] #1  0x0000ffffa71252c8 [PAC] in pthread_cond_timedwait@@GLIBC_2.17 () from /lib64/libc.so.6
      [2026/02/25 16:28:31.086] #2  0x0000aaaacf7ce4c0 [PAC] in __wt_cond_wait_signal (session=session@entry=0x31f5fb808cb8, cond=0x31f5ff409e60, usecs=usecs@entry=10000, run_func=<optimized out>, run_func@entry=0xffff76a3efc0, signalled=signalled@entry=0xffff76a3f000) at ./src/third_party/wiredtiger/src/os_posix/os_mtx_cond.c:115
      [2026/02/25 16:28:31.086] #3  0x0000aaaacf95bf18 in __wt_cond_wait (session=0x31f5fb808cb8, cond=0x89, cond@entry=0x0, usecs=10000, run_func=0x0) at src/third_party/wiredtiger/src/include/misc_inline.h:21
      [2026/02/25 16:28:31.086] #4  __wti_evict_app_assist_worker (session=session@entry=0x31f5fb808cb8, busy=<optimized out>, readonly=false, interruptible=true) at ./src/third_party/wiredtiger/src/evict/evict_lru.c:3226
      [2026/02/25 16:28:31.086] #5  0x0000aaaad172d564 in __wt_evict_app_assist_worker_check (session=session@entry=0x31f5fb808cb8, busy=false, readonly=false, interruptible=true, didworkp=0x0) at src/third_party/wiredtiger/src/include/../evict/evict_inline.h:960
      [2026/02/25 16:28:31.086] #6  0x0000aaaacf80c0e0 in __wt_txn_commit (session=session@entry=0x31f5fb808cb8, cfg=cfg@entry=0xffff76a3f200) at ./src/third_party/wiredtiger/src/txn/txn.c:1833
      [2026/02/25 16:28:31.086] #7  0x0000aaaad1703f94 in __session_commit_transaction (wt_session=0x31f5fb808cb8, config=<optimized out>) at ./src/third_party/wiredtiger/src/session/session_api.c:1934
      [2026/02/25 16:28:31.086] #8  0x0000aaaacf72f84c in mongo::WiredTigerSession::commit_transaction<decltype(nullptr)>(decltype(nullptr)&&) (this=0x31f5ff4af500, args=<optimized out>) at src/mongo/db/storage/wiredtiger/wiredtiger_session.h:140
      [2026/02/25 16:28:31.086] #9  mongo::WiredTigerRecoveryUnit::_txnClose (this=this@entry=0x31f5ff63f200, commit=<optimized out>) at ./src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp:393
      [2026/02/25 16:28:31.086] #10 0x0000aaaacf7551c8 in mongo::WiredTigerRecoveryUnit::_commit (this=0x31f5ff63f200) at ./src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp:155
      [2026/02/25 16:28:31.086] #11 mongo::WiredTigerRecoveryUnit::doCommitUnitOfWork (this=0x31f5ff63f200) at ./src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp:228
      [2026/02/25 16:28:31.086] #12 0x0000aaaacf7f4f3c in mongo::RecoveryUnit::commitUnitOfWork (this=0x31f5ff63f200) at src/mongo/db/storage/recovery_unit.cpp:116
      [2026/02/25 16:28:31.086] #13 mongo::WriteUnitOfWork::commit (this=0xffff76a3f580) at ./src/mongo/db/storage/write_unit_of_work.cpp:159
      [2026/02/25 16:28:31.086] #14 0x0000aaaad1d3f7fc in mongo::IndexCatalogEntryImpl::_setMultikeyInMultiDocumentTransaction(mongo::OperationContext*, mongo::CollectionPtr const&, boost::container::small_vector<boost::container::flat_set<unsigned char, std::less<unsigned char>, boost::container::small_vector<unsigned char, 4ul, void, void> >, 4ul, void, void> const&) const::$_0::operator()() const (this=this@entry=0xffff76a3f608) at ./src/mongo/db/shard_role/shard_catalog/index_catalog_entry_impl.cpp:451
      [2026/02/25 16:28:31.086] #15 0x0000aaaad1d3eb18 in mongo::WriteConflictRetryAlgorithm::operator()<mongo::IndexCatalogEntryImpl::_setMultikeyInMultiDocumentTransaction(mongo::OperationContext*, mongo::CollectionPtr const&, boost::container::small_vector<boost::container::flat_set<unsigned char, std::less<unsigned char>, boost::container::small_vector<unsigned char, 4ul, void, void> >, 4ul, void, void> const&) const::$_0>(mongo::IndexCatalogEntryImpl::_setMultikeyInMultiDocumentTransaction(mongo::OperationContext*, mongo::CollectionPtr const&, boost::container::small_vector<boost::container::flat_set<unsigned char, std::less<unsigned char>, boost::container::small_vector<unsigned char, 4ul, void, void> >, 4ul, void, void> const&) const::$_0&&) (this=0xffff76a3f700, f=...) at src/mongo/db/shard_role/lock_manager/exception_util.h:189
      [2026/02/25 16:28:31.086] #16 mongo::writeConflictRetry<mongo::IndexCatalogEntryImpl::_setMultikeyInMultiDocumentTransaction(mongo::OperationContext*, mongo::CollectionPtr const&, boost::container::small_vector<boost::container::flat_set<unsigned char, std::less<unsigned char>, boost::container::small_vector<unsigned char, 4ul, void, void> >, 4ul, void, void> const&) const::$_0>(mongo::OperationContext*, mongo::StringData, mongo::NamespaceStringOrUUID const&, mongo::IndexCatalogEntryImpl::_setMultikeyInMultiDocumentTransaction(mongo::OperationContext*, mongo::CollectionPtr const&, boost::container::small_vector<boost::container::flat_set<unsigned char, std::less<unsigned char>, boost::container::small_vector<unsigned char, 4ul, void, void> >, 4ul, void, void> const&) const::$_0&&, boost::optional<unsigned long>, int) (opCtx=0x31f5f9ebd900, nssOrUUID=..., f=..., dumpStateRetryCount=0, opStr=..., retryLimit=...) at src/mongo/db/shard_role/lock_manager/exception_util.h:278
      [2026/02/25 16:28:31.086] #17 mongo::IndexCatalogEntryImpl::_setMultikeyInMultiDocumentTransaction (this=this@entry=0x31f5ff574100, opCtx=opCtx@entry=0x31f5f9ebd900, collection=..., multikeyPaths=...) at ./src/mongo/db/shard_role/shard_catalog/index_catalog_entry_impl.cpp:374
      [2026/02/25 16:28:31.086] #18 0x0000aaaad1d3e62c in mongo::IndexCatalogEntryImpl::setMultikey (this=<optimized out>, opCtx=0x31f5f9ebd900, collection=..., multikeyMetadataKeys=..., multikeyPaths=...) at ./src/mongo/db/shard_role/shard_catalog/index_catalog_entry_impl.cpp:301
      [2026/02/25 16:28:31.086] #19 0x0000aaaacf80d588 in mongo::SortedDataIndexAccessMethod::insertKeysAndUpdateMultikeyPaths (this=0x31f5ff583520, opCtx=0x31f5f9ebd900, ru=..., coll=..., entry=0x31f5ff574100, keys=..., multikeyMetadataKeys=..., multikeyPaths=..., options=..., includeDuplicateRecordId=mongo::IncludeDuplicateRecordId::kOff, containerWriteBehavior=mongo::SortedDataIndexAccessMethod::ContainerWriteBehavior::kUnreplicated, onDuplicateKey=..., numInserted=<optimized out>) at ./src/mongo/db/index/index_access_method.cpp:393
      [2026/02/25 16:28:31.086] #20 mongo::SortedDataIndexAccessMethod::_indexKeysOrWriteToSideTable (this=0x31f5ff583520, opCtx=0x31f5f9ebd900, coll=..., entry=0x31f5ff574100, keys=..., multikeyMetadataKeys=..., multikeyPaths=..., obj=..., options=..., keysInsertedOut=0xffff76a40230) at ./src/mongo/db/index/index_access_method.cpp:1589
      [2026/02/25 16:28:31.086] #21 mongo::SortedDataIndexAccessMethod::insert (this=0x31f5ff583520, opCtx=<optimized out>, pooledBuilder=..., coll=..., entry=0x31f5ff574100, bsonRecords=..., options=..., numInserted=0xffff76a40230) at ./src/mongo/db/index/index_access_method.cpp:276
      [2026/02/25 16:28:31.086] #22 0x0000aaaacf7bf4b4 in mongo::IndexCatalogImpl::_indexFilteredRecords (this=this@entry=0x31f5f8939600, opCtx=opCtx@entry=0x31f5f9ebd900, coll=..., index=index@entry=0x31f5ff574100, bsonRecords=std::vector of length 1, capacity 1 = {...}, keysInsertedOut=keysInsertedOut@entry=0xffff76a40230) at ./src/mongo/db/shard_role/shard_catalog/index_catalog_impl.cpp:1783
      [2026/02/25 16:28:31.086] #23 0x0000aaaacf7beffc in mongo::IndexCatalogImpl::_indexRecords (this=this@entry=0x31f5f8939600, opCtx=opCtx@entry=0x31f5f9ebd900, coll=..., index=0x31f5ff574100, bsonRecords=std::vector of length 1, capacity 1 = {...}, keysInsertedOut=keysInsertedOut@entry=0xffff76a40230) at ./src/mongo/db/shard_role/shard_catalog/index_catalog_impl.cpp:1806
      [2026/02/25 16:28:31.086] #24 0x0000aaaacf7beb34 in mongo::IndexCatalogImpl::indexRecords (this=0x31f5f8939600, opCtx=0x31f5f9ebd900, coll=..., bsonRecords=std::vector of length 1, capacity 1 = {...}, keysInsertedOut=0xffff76a40230) at ./src/mongo/db/shard_role/shard_catalog/index_catalog_impl.cpp:1951
      [2026/02/25 16:28:31.086] #25 0x0000aaaacf82de9c in mongo::(anonymous namespace)::insertDocumentsImpl (opCtx=<optimized out>, collection=..., begin=..., end=..., opDebug=0x31f5fa5ebe60, fromMigrate=false) at ./src/mongo/db/collection_crud/collection_write_path.cpp:377
      [2026/02/25 16:28:31.086] #26 mongo::collection_internal::insertDocuments (opCtx=<optimized out>, opCtx@entry=0x31f5f9ebd900, collection=..., begin=..., end=..., opDebug=0x31f5fa5ebe60, fromMigrate=false) at ./src/mongo/db/collection_crud/collection_write_path.cpp:585
      [2026/02/25 16:28:31.086] #27 0x0000aaaad239082c in mongo::write_ops_exec::(anonymous namespace)::insertDocumentsAtomically (opCtx=0x31f5f9ebd900, collection=..., begin=..., end=..., fromMigrate=<optimized out>) at ./src/mongo/db/query/write_ops/write_ops_exec.cpp:387

      I have attached logs and core dumps for the failing test (transaction_too_large_for_cache.js when run in no_passthrough_disagg_override.yml suite )

      Motivation

      Does this affect any team outside of WT? Are they blocked? Are they waiting for an answer?

      • It prevents the cluster scalability team that is trying to add testing for transactions from running certain transaction related tests in disagg. 

      How likely is it that this use case or problem will occur? Main path? Edge case? Frequency of the issue?

      • Edge case path when eviction is necessary due to cache pressure.

      If the problem does occur, what are the consequences and how severe are they? A minor annoyance at a log message? Performance concern? Outage/unavailability? Test Failure?

      • The operation will hang indefinitely, and possibly new user operations will also stall waiting for eviction.

      Is this issue urgent? Does this ticket have a required timeline? What is it?

      • Semi-urgent as we want to ensure transactions work correctly for public preview.

      Acceptance Criteria (Definition of Done)

      https://github.com/wiredtiger/wiredtiger/wiki/Creating-tickets-that-are-likely-to-be-actioned#acceptance-criteria-definition-of-done
      When will this ticket be considered done? What is the acceptance criteria for this ticket to be closed?

      • Tests in 
        jstests/noPassthrough/txns_cache_errors/*.js work when run in no_passthrough_disagg_override no longer hang.

      Testing

      https://github.com/wiredtiger/wiredtiger/wiki/Creating-tickets-that-are-likely-to-be-actioned#testing
      What all testing needs to be done as part of this ticket? Unit? Functional? Performance? Testing at MongoDB side?

      • Testing on MongoDB side to ensure the tests in 
        jstests/noPassthrough/txns_cache_errors/{}{*}.js work when run in no_passthrough_disagg_override.

       
       

        1. logs.txt
          502 kB
          Wenqin Ye
        2. stack_traces.txt
          613 kB
          Wenqin Ye

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Wenqin Ye
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: