[SERVER-65905] Shard split test times out non deterministically Created: 22/Apr/22  Updated: 12/May/22  Resolved: 12/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Didier Nadeau Assignee: Didier Nadeau
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
is caused by SERVER-66349 Recipient monitors are not shut down ... Closed
Operating System: ALL
Participants:

 Description   

On an patch for a change unrelated to shard split on top of master, the shard_split_write_during_split_stepdown.js test has timed out.

This test does the following :

  • Set the pauseShardSplitAfterBlocking failpoint
  • Start a shard split, wait for the failpoint
  • Start async write operations
  • Stepdown the primary
  • Wait for shard split to return InterruptedDueToReplStateChange
  • Check the write thread returns TenantMigrationCommitted
  • Stop the test

In this run, a shard split timeout occurred simultaneously with the stepdown. The blocking failpoint can be "unblocked" by the timeout or stepdown as it's  interruptible. The split was resumed by another node (d20522) following stepdown of the initial primary (d20520). The split ran to completion on d20522 then waited for the forget command. The test framework did not send this command and simply stopped it (as expected). d20522 never stopped and the test timed out.



 Comments   
Comment by Matt Broadstone [ 12/May/22 ]

Closing this since we think SERVER-66349 will fix the issue.

Comment by Didier Nadeau [ 22/Apr/22 ]

Important points seen in investigation :

  • Shard split restarted after stepup link 
  • Shard split completed and was waiting for forget link
  • PrimaryService stuck waiting on join thread  log stacktrace : 

    Thread 2: "" (Thread 0x7f09eed4c700 (LWP 27086))
    #0  0x00007f0a0017f48c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
    #1  0x00007f0a01eafeac in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /data/debug/lib/libbase.so
    #2  0x00007f0a01e86a10 in std::_V2::condition_variable_any::wait<std::unique_lock<mongo::latch_detail::Latch> > (this=0x5651255862f0, __lock=...) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/bits/std_mutex.h:238
    #3  0x00007f09fbb4f38b in std::_V2::condition_variable_any::wait<std::unique_lock<mongo::latch_detail::Latch>, mongo::executor::ThreadPoolTaskExecutor::_join(std::unique_lock<mongo::latch_detail::Latch>)::<lambda()> > (__p=..., __lock=..., this=0x5651255862f0) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/condition_variable:269
    #4  mongo::executor::ThreadPoolTaskExecutor::_join (this=0x565125586210, lk=...) at src/mongo/executor/thread_pool_task_executor.cpp:188
    #5  0x00007f09fbb4f72e in mongo::executor::ThreadPoolTaskExecutor::join (this=0x565125586210) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/bits/move.h:47
    #6  0x00007f09fe2200d0 in mongo::repl::PrimaryOnlyService::shutdown (this=0x56512095bf00) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/bits/shared_ptr_base.h:1018
    #7  0x00007f09fe220422 in mongo::repl::PrimaryOnlyServiceRegistry::onShutdown (this=<optimized out>) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/bits/unique_ptr.h:345
    #8  0x00007f09fe1f2225 in mongo::ReplicaSetAwareServiceRegistry::<lambda(mongo::ReplicaSetAwareInterface*)>::operator() (__closure=<optimized out>, service=<optimized out>) at src/mongo/db/repl/replica_set_aware_service.cpp:73
    #9  std::for_each<__gnu_cxx::__normal_iterator<mongo::ReplicaSetAwareInterface**, std::vector<mongo::ReplicaSetAwareInterface*> >, mongo::ReplicaSetAwareServiceRegistry::onShutdown()::<lambda(mongo::ReplicaSetAwareInterface*)> > (__last=0x7a69726f68747541, __first=0x5651208f0de0, __f=...) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/bits/stl_algo.h:3882
    #10 mongo::ReplicaSetAwareServiceRegistry::onShutdown (this=<optimized out>) at src/mongo/db/repl/replica_set_aware_service.cpp:74
    #11 0x00007f09fdfe0ea3 in mongo::repl::ReplicationCoordinatorImpl::shutdown (this=0x5651205be000, opCtx=0x56512557f600) at src/mongo/db/repl/replication_coordinator_impl.cpp:1026
    #12 0x00007f0a02325cd0 in mongo::(anonymous namespace)::shutdownTask (shutdownArgs=...) at src/mongo/db/mongod_main.cpp:1288
    #13 0x00007f0a01e85a75 in mongo::unique_function<void (mongo::ShutdownTaskArgs const&)>::operator()(mongo::ShutdownTaskArgs const&) const (args#0=..., this=<optimized out>) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/bits/stl_deque.h:255
    #14 mongo::(anonymous namespace)::runTasks(std::stack<mongo::unique_function<void(const mongo::ShutdownTaskArgs&)>, std::deque<mongo::unique_function<void(const mongo::ShutdownTaskArgs&)>, std::allocator<mongo::unique_function<void(const mongo::ShutdownTaskArgs&)> > > >, const mongo::ShutdownTaskArgs &) (tasks=std::stack wrapping: std::deque with 1 element = {...}, shutdownArgs=...) at src/mongo/util/exit.cpp:61
    #15 0x00007f0a01dccb7c in mongo::shutdown (code=code@entry=mongo::EXIT_CLEAN, shutdownArgs=...) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/bits/move.h:182
    #16 0x00007f09ff63c855 in mongo::exitCleanly (code=mongo::EXIT_CLEAN) at src/third_party/boost/boost/optional/optional.hpp:872
    #17 mongo::(anonymous namespace)::handleOneSignal (rotation=<synthetic pointer>, rotation=<synthetic pointer>, waited=...) at src/mongo/util/signal_handlers.cpp:264
    #18 mongo::(anonymous namespace)::signalProcessingThread (rotate=mongo::LogFileStatus::kNeedToRotateLogFile) at src/mongo/util/signal_handlers.cpp:305
    #19 0x00007f09ff63ca35 in std::__invoke_impl<void, void (*)(mongo::LogFileStatus), mongo::LogFileStatus> (__f=@0x5651209bbc10: 0x7f09ff63b990 <mongo::(anonymous namespace)::signalProcessingThread(mongo::LogFileStatus)>) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/bits/invoke.h:89
    #20 std::__invoke<void (*)(mongo::LogFileStatus), mongo::LogFileStatus> (__fn=@0x5651209bbc10: 0x7f09ff63b990 <mongo::(anonymous namespace)::signalProcessingThread(mongo::LogFileStatus)>) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/bits/invoke.h:95
    #21 std::__apply_impl<void (*)(mongo::LogFileStatus), std::tuple<mongo::LogFileStatus>, 0ul> (__f=@0x5651209bbc10: 0x7f09ff63b990 <mongo::(anonymous namespace)::signalProcessingThread(mongo::LogFileStatus)>, __t=...) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/tuple:1678
    #22 std::apply<void (*)(mongo::LogFileStatus), std::tuple<mongo::LogFileStatus> > (__f=@0x5651209bbc10: 0x7f09ff63b990 <mongo::(anonymous namespace)::signalProcessingThread(mongo::LogFileStatus)>, __t=...) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/tuple:1687
    #23 mongo::stdx::thread::thread<void (*)(mongo::LogFileStatus), mongo::LogFileStatus&, 0>(void (*)(mongo::LogFileStatus), mongo::LogFileStatus&)::{lambda()#1}::operator()() (this=0x5651209bbc08) at src/mongo/stdx/thread.h:193
    #24 std::__invoke_impl<void, mongo::stdx::thread::thread<void (*)(mongo::LogFileStatus), mongo::LogFileStatus&, 0>(void (*)(mongo::LogFileStatus), mongo::LogFileStatus&)::{lambda()#1}>(std::__invoke_other, mongo::stdx::thread::thread<void (*)(mongo::LogFileStatus), mongo::LogFileStatus&, 0>(void (*)(mongo::LogFileStatus), mongo::LogFileStatus&)::{lambda()#1}&&) (__f=...) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/bits/invoke.h:60
    #25 std::__invoke<mongo::stdx::thread::thread<void (*)(mongo::LogFileStatus), mongo::LogFileStatus&, 0>(void (*)(mongo::LogFileStatus), mongo::LogFileStatus&)::{lambda()#1}>(mongo::stdx::thread::thread<void (*)(mongo::LogFileStatus), mongo::LogFileStatus&, 0>(void (*)(mongo::LogFileStatus), mongo::LogFileStatus&)::{lambda()#1}&&) (__fn=...) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/bits/invoke.h:95
    #26 std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<void (*)(mongo::LogFileStatus), mongo::LogFileStatus&, 0>(void (*)(mongo::LogFileStatus), mongo::LogFileStatus&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x5651209bbc08) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/thread:244
    #27 std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<void (*)(mongo::LogFileStatus), mongo::LogFileStatus&, 0>(void (*)(mongo::LogFileStatus), mongo::LogFileStatus&)::{lambda()#1}> >::operator()() (this=0x5651209bbc08) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/thread:253
    #28 std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<void (*)(mongo::LogFileStatus), mongo::LogFileStatus&, 0>(void (*)(mongo::LogFileStatus), mongo::LogFileStatus&)::{lambda()#1}> > >::_M_run() (this=0x5651209bbc00) at /opt/mongodbtoolchain/revisions/c6da1cf7f0b4b60d53566305e59857d3d540dcf7/stow/gcc-v3.IGP/include/c++/8.5.0/thread:196
    #29 0x00007f09f8f076af in execute_native_thread_routine () from /data/debug/lib/libbackground_job.so
    #30 0x00007f0a001792de in start_thread () from /lib64/libpthread.so.0
    #31 0x00007f09ffea9a63 in clone () from /lib64/libc.so.6 

    • Unfortunately due to symbol being optimized out, we can't have more information on frame 6. We don't know if `shutdown` hangs while joining the scoped executor or the cleanup executor
  •  
  • Primary ser
Generated at Thu Feb 08 06:03:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.