-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage Engines, Storage Engines - Foundations
-
None
-
None
Summary
An effective hang in WT was observed in Disagg testing.
The general workflow is a sharded cluster, where analyzeShardKey is run on the collection shard primary (shard1) and then is routed to shard0 and shard1 (itself). The call hangs for 15 minutes and the final stacktrace shows WT in RandomCursor::next() logic.
Note that all workflows happen on the Standby node (due to readPreference "secondaryPreferred"). The tests pass as expected when readPreference is "primary".
Test context is that the test does high volume writes and then batch deletes, so the collection may be empty at the time this hang occurs.
The hang is easily reproduced, so additional logging can easily be added for better understanding of the logic.
Motivation
The Cluster Scalability team is wrapping up initial disagg validation and one of the remaining failing tests involves this workflow. Getting a fix or a workaround for testing will unblock validation of the functionality tested here.
Acceptance Critera
A valid root cause for disagg is determined and a fix/workaround is provided to move forward with disagg testing.
Testing
The hang is easily reproduced.
Branch: jordanglassley-mdb/SERVER-121429
Variant: ! Amazon Linux 2023 arm64 Atlas Enterprise (all feature flags, extra system deps)
Suite: disagg_analyze_shard_key_jscore_passthrough
Tests:
- https://github.com/10gen/mongo/blob/master/jstests/core/query/delete/batched_multi_deletes_a.js
- https://github.com/10gen/mongo/blob/master/jstests/core/query/delete/batched_multi_deletes_id.js
Hook:
Example of failing patch:
https://spruce.corp.mongodb.com/version/69ea592c01ac130007e3cbbe/tasks
Additional investigation detail
Here are the main relevant nodes and their stacktrace at the EVG timeout. The full stacktraces are attached to this ticket.
Coordinator (Collection Primary Shard - Shard 1 - Node 1) [pid 236536]
Last message:
[j0:s1:n1] | 2026-04-23T18:21:32.423+00:00 I QUERY 9981957 [S] [conn228] "CursorEstablisher::_waitForResponse: calling _ars->nextResponse()","attr":{"namespace":"test.batched_multi_deletes_a"}
Stacktrace:
Thread 5 (Thread 0xffff3aaa06c0 (LWP 239424)): ... #32 mongo::sharded_agg_helpers::targetShardsAndAddMergeCursors (expCtx=..., targetRequest=..., shardCursorsSortSpec=..., shardTargetingPolicy=..., readConcern=..., useCollectionDefaultCollator=...) at ./src/mongo/db/pipeline/sharded_agg_helpers.cpp:2004 #33 0x0000aaaac8ff07ac in mongo::ShardServerProcessInterface::preparePipelineForExecution (this=..., expCtx=..., aggRequest=..., pipeline=..., shardCursorsSortSpec=..., shardTargetingPolicy=..., readConcern=..., shouldUseCollectionDefaultCollator=...) at ./src/mongo/db/pipeline/process_interface/shardsvr_process_interface.cpp:750 #34 0x0000aaaac9c93f6c in mongo::pipeline_factory::(anonymous namespace)::finalizePipeline (pipeline=..., expCtx=..., opts=..., aggRequest=..., shardCursorsSortSpec=..., readConcernOverride=...) at ./src/mongo/db/pipeline/pipeline_factory.cpp:101 #35 0x0000aaaac9c945e0 in mongo::pipeline_factory::makePipeline (aggRequest=..., expCtx=..., shardCursorsSortSpec=..., opts=...) at ./src/mongo/db/pipeline/pipeline_factory.cpp:170 #36 0x0000aaaac8f287b0 in mongo::SamplingBasedSplitPolicy::_makePipeline (opCtx=..., ns=..., shardKey=..., numInitialChunks=..., samplesPerChunk=..., opts=...) at ./src/mongo/db/s/config/initial_split_policy.cpp:960 #37 0x0000aaaac8f26404 in mongo::SamplingBasedSplitPolicy::_makePipelineDocumentSource (opCtx=..., ns=..., shardKey=..., numInitialChunks=..., samplesPerChunk=..., opts=...) at ./src/mongo/db/s/config/initial_split_policy.cpp:907 #38 0x0000aaaac8f260a4 in mongo::SamplingBasedSplitPolicy::make (opCtx=..., nss=..., shardKey=..., numInitialChunks=..., zones=..., availableShardIds=..., samplesPerChunk=...) at ./src/mongo/db/s/config/initial_split_policy.cpp:749 #39 0x0000aaaac85718f8 in mongo::analyze_shard_key::(anonymous namespace)::generateSplitPoints (opCtx=..., analyzeShardKeyId=..., nss=..., collUuid=..., shardKey=...) at ./src/mongo/db/s/analyze_shard_key_cmd_util.cpp:1007
Shard 1 - Node 1 [pid 236536]
Last message:
[j0:s1:n1] | 2026-04-23T18:21:32.423+00:00 W STORAGE 9981970 [S] [conn304] "RandomCursor::next still running","attr":{"uri":"table:collection-0cbbf3c9-cf85-410c-9ce9-536560e76421","prepareRetries":1,"elapsedMs":0}
Stacktrace
Thread 4 (Thread 0xffff3cd706c0 (LWP 240204)): #0 __wt_atomic_load_uint64_v_relaxed (vp=...) at src/third_party/wiredtiger/src/include/gcc.h:409 #1 __wt_session_gen_enter (session=..., which=...) at src/third_party/wiredtiger/src/support/generation.c:340 #2 __wt_btcur_next_random (cbt=...) at ./src/third_party/wiredtiger/src/btree/bt_random.c:557 #3 0x0000aaaac7e0e9a4 in __wti_curfile_next_random (cursor=...) at ./src/third_party/wiredtiger/src/cursor/cur_file.c:215 #4 0x0000aaaac7e2db38 in __clayered_constituent_iter (constituent=..., forward=...) at ./src/third_party/wiredtiger/src/cursor/cur_layered.c:917 #5 __clayered_iterate_constituents (clayered=..., iter_flag=...) at ./src/third_party/wiredtiger/src/cursor/cur_layered.c:978 #6 __clayered_iterate (clayered=..., iter_flag=...) at ./src/third_party/wiredtiger/src/cursor/cur_layered.c:1093 #7 0x0000aaaac7e2f16c in __clayered_search_near_int (session=..., cursor=..., exactp=...) at ./src/third_party/wiredtiger/src/cursor/cur_layered.c:1771 #8 0x0000aaaac7e2d02c in __clayered_next_random (cursor=...) at ./src/third_party/wiredtiger/src/cursor/cur_layered.c:2503 #9 0x0000aaaac5946a64 in mongo::WiredTigerRecordStore::RandomCursor::next()::{lambda()#1}::operator()() const (this=...) at ./src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp:548 #10 0x0000aaaac59462cc in mongo::wiredTigerPrepareConflictRetry<mongo::WiredTigerRecordStore::RandomCursor::next()::{lambda()#1}>(mongo::Interruptible&, mongo::PrepareConflictTracker&, mongo::RecoveryUnit&, mongo::WiredTigerRecordStore::RandomCursor::next()::{lambda()#1}&&) (interruptible=..., f=..., tracker=..., ru=...) at src/mongo/db/storage/wiredtiger/wiredtiger_prepare_conflict.h:75
Shard 0 - Node 1 [pid 236840]
Last message:
[j0:s1:n1] | 2026-04-23T18:21:32.423+00:00 W STORAGE 9981970 [S] [conn304] "RandomCursor::next still running","attr":{"uri":"table:collection-0cbbf3c9-cf85-410c-9ce9-536560e76421","prepareRetries":1,"elapsedMs":0}
Stacktrace:
Thread 6 (Thread 0xffff4327f6c0 (LWP 240441)): #0 0x0000aaaab3607520 in __wt_epoch_raw (session=..., tsp=...) at ./src/third_party/wiredtiger/src/os_posix/os_time.c:18 #1 0x0000aaaab63f5f3c in __wt_epoch (session=..., tsp=...) at src/third_party/wiredtiger/src/include/time_inline.h:81 #2 __wt_btcur_next_random (cbt=...) at ./src/third_party/wiredtiger/src/btree/bt_random.c:526 #3 0x0000aaaab646e9a4 in __wti_curfile_next_random (cursor=...) at ./src/third_party/wiredtiger/src/cursor/cur_file.c:215 #4 0x0000aaaab648db38 in __clayered_constituent_iter (constituent=..., forward=...) at ./src/third_party/wiredtiger/src/cursor/cur_layered.c:917 #5 __clayered_iterate_constituents (clayered=..., iter_flag=...) at ./src/third_party/wiredtiger/src/cursor/cur_layered.c:978 #6 __clayered_iterate (clayered=..., iter_flag=...) at ./src/third_party/wiredtiger/src/cursor/cur_layered.c:1093 #7 0x0000aaaab648f16c in __clayered_search_near_int (session=..., cursor=..., exactp=...) at ./src/third_party/wiredtiger/src/cursor/cur_layered.c:1771 #8 0x0000aaaab648d02c in __clayered_next_random (cursor=...) at ./src/third_party/wiredtiger/src/cursor/cur_layered.c:2503 #9 0x0000aaaab3fa6a64 in mongo::WiredTigerRecordStore::RandomCursor::next()::{lambda()#1}::operator()() const (this=...) at ./src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp:548
- is related to
-
WT-15189 test/format disagg follower and Python tests time out in clayered_next_random
-
- Open
-