[SERVER-50391] Index build hung during startup recovery waiting for optime to be majority committed Created: 19/Aug/20  Updated: 29/Oct/23  Resolved: 20/Aug/20

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Bug Priority: Major - P3
Reporter: Samyukta Lanka Assignee: Benety Goh
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-50525 Resumable index build hangs waiting f... Closed
is related to SERVER-48419 Extend rollback to recover resumable ... Closed
is related to SERVER-48476 resumable index build should use majo... Closed
is related to SERVER-49847 Turn on resumable index build feature... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2020-08-24
Participants:

 Description   

When trying to turn on the resumable index build behavior by default, we found that it's possible for a node to be hung during startup recovery after an unclean shutdown while applying a commitIndexBuild oplog entry. The node is stuck waiting for the last optime before it established the interceptor to be majority committed even though other nodes in the replica set have applied entries from a later term than the optime the node is waiting on.



 Comments   
Comment by Githook User [ 20/Aug/20 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-50391 do not wait for majority commit point while applying startIndexBuild during recovery
Branch: master
https://github.com/mongodb/mongo/commit/504dee509b57ba039bcfe1130054aabc13839fa9

Comment by Benety Goh [ 20/Aug/20 ]

The majority commit wait was added in SERVER-48476 to support the work in SERVER-48419.

Comment by Samyukta Lanka [ 19/Aug/20 ]

Here are the stack traces showing that the node is hung waiting on the majority commit point to advance:

Thread 48 (Thread 0x7f089f343700 (LWP 70583)):
#0  0x00007f08ce857403 in poll () from /lib64/libc.so.6
#1  0x00007f08ce1a2a07 in mongo::transport::TransportLayerASIO::BatonASIO::run (this=0x7f08bcb433b0, clkSource=0x7f08c35eaca0) at /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.Fno/include/c++/8.2.0/bits/stl_vector.h:805
#2  0x00007f08c94b24bb in mongo::Waitable::wait<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&)::{lambda()#1}::operator()() const (this=<optimized out>) at src/mongo/util/lockable_adapter.h:57
#3  mongo::stdx::condition_variable::_runWithNotifyable<mongo::Waitable::wait<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&)::{lambda()#1}>(mongo::Notifyable&, mongo::BasicLockableAdapter&&) (cb=..., notifyable=..., this=0x7f08c3221590) at src/mongo/stdx/condition_variable.h:162
#4  mongo::Waitable::wait<mongo::BasicLockableAdapter> (lk=..., cv=..., clkSource=<optimized out>, waitable=0x7f08bcb433b0) at src/mongo/util/waitable.h:61
#5  mongo::OperationContext::<lambda()>::operator() (__closure=<optimized out>) at src/mongo/db/operation_context.cpp:325
#6  mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil (this=0x7f08bcb431e0, cv=..., m=..., deadline=...) at src/mongo/db/operation_context.cpp:330
#7  0x00007f08cdc6b8f5 in mongo::Interruptible::<lambda(mongo::Date_t, mongo::Interruptible::WakeSpeed)>::operator()(mongo::Date_t, mongo::Interruptible::WakeSpeed) const (this=0x7f08bcb431e0, deadline=..., speed=mongo::Interruptible::WakeSpeed::kSlow) at src/mongo/util/lockable_adapter.h:50
#8  0x00007f08cdc8a194 in mongo::Interruptible::<lambda(mongo::Date_t, mongo::Interruptible::WakeSpeed)>::operator() (speed=mongo::Interruptible::WakeSpeed::kSlow, deadline=..., this=<optimized out>) at src/mongo/db/repl/replication_coordinator_impl.cpp:1605
#9  mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Mutex>, mongo::repl::ReplicationCoordinatorImpl::waitUntilMajorityOpTime(mongo::OperationContext*, mongo::repl::OpTime, boost::optional<mongo::Date_t>)::<lambda()> > (waitTimer=0x0, pred=..., finalDeadline=..., m=..., cv=..., this=0x7f08bcb431e0) at src/mongo/util/interruptible.h:448
#10 mongo::repl::ReplicationCoordinatorImpl::waitUntilMajorityOpTime (this=0x7f08c3221020, opCtx=0x7f08bcb431e0, targetOpTime=..., deadline=...) at src/mongo/db/repl/replication_coordinator_impl.cpp:1601
#11 0x00007f08cb4e434b in mongo::IndexBuildsCoordinator::_awaitLastOpTimeBeforeInterceptorsMajorityCommitted (this=<optimized out>, opCtx=0x7f08bcb431e0, replState=std::shared_ptr<mongo::ReplIndexBuildState> (use count 9, weak count 0) = {...}) at /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.Fno/include/c++/8.2.0/bits/shared_ptr_base.h:996

And that the node is trying to apply the commitIndexBuild oplog entry:

Thread 38 (Thread 0x7f08a574d700 (LWP 70572)):
#0  0x00007f08ce857403 in poll () from /lib64/libc.so.6
#1  0x00007f08ce1a2a07 in mongo::transport::TransportLayerASIO::BatonASIO::run (this=0x7f08bd3533b0, clkSource=0x7f08c35eaca0) at /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.Fno/include/c++/8.2.0/bits/stl_vector.h:805
#2  0x00007f08c94b24bb in mongo::Waitable::wait<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&)::{lambda()#1}::operator()() const (this=<optimized out>) at src/mongo/util/lockable_adapter.h:57
#3  mongo::stdx::condition_variable::_runWithNotifyable<mongo::Waitable::wait<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&)::{lambda()#1}>(mongo::Notifyable&, mongo::BasicLockableAdapter&&) (cb=..., notifyable=..., this=0x7f08bbe74f90) at src/mongo/stdx/condition_variable.h:162
#4  mongo::Waitable::wait<mongo::BasicLockableAdapter> (lk=..., cv=..., clkSource=<optimized out>, waitable=0x7f08bd3533b0) at src/mongo/util/waitable.h:61
#5  mongo::OperationContext::<lambda()>::operator() (__closure=<optimized out>) at src/mongo/db/operation_context.cpp:325
#6  mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil (this=0x7f08c34ad720, cv=..., m=..., deadline=...) at src/mongo/db/operation_context.cpp:330
#7  0x00007f08ce174925 in mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}>(mongo::stdx::condition_variable&, std::unique_lock<mongo::latch_detail::Latch>&, mongo::Date_t, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}, mongo::AtomicWord<long>*)::{lambda(auto:1&, mongo::Interruptible::WakeSpeed)#3}::operator()(std::unique_lock<mongo::latch_detail::Latch>&, mongo::AtomicWord<long>*) const (this=0x7f08c34ad720, deadline=..., deadline@entry=..., speed=speed@entry=mongo::Interruptible::WakeSpeed::kSlow) at src/mongo/util/lockable_adapter.h:50
#8  0x00007f08ce174df1 in mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}>(mongo::stdx::condition_variable&, std::unique_lock<mongo::latch_detail::Latch>&, mongo::Date_t, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}, mongo::AtomicWord<long>*)::{lambda(auto:1&, mongo::Interruptible::WakeSpeed)#4}::operator()(std::unique_lock<mongo::latch_detail::Latch>&, mongo::AtomicWord<long>*) const (speed=mongo::Interruptible::WakeSpeed::kSlow, deadline=..., this=<optimized out>) at /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.Fno/include/c++/8.2.0/atomic:246
#9  mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}>(mongo::stdx::condition_variable&, std::unique_lock<mongo::latch_detail::Latch>&, mongo::Date_t, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}, mongo::AtomicWord<long>*) (waitTimer=0x0, pred=..., finalDeadline=..., m=..., cv=..., this=0x7f08c34ad720) at src/mongo/util/interruptible.h:448
#10 mongo::Interruptible::waitForConditionOrInterrupt<std::unique_lock<mongo::latch_detail::Latch>, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}>(mongo::stdx::condition_variable&, std::unique_lock<mongo::latch_detail::Latch>&, mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)::{lambda()#1}, mongo::AtomicWord<long>*) (waitTimer=0x0, pred=..., m=..., cv=..., this=0x7f08c34ad720) at src/mongo/util/interruptible.h:466
#11 mongo::future_details::SharedStateBase::wait (this=<optimized out>, interruptible=0x7f08c34ad720) at src/mongo/util/future_impl.h:402
#12 0x00007f08cb4e27d1 in mongo::future_details::SharedStateHolder<mongo::ReplIndexBuildState::IndexCatalogStats>::waitNoThrow (interruptible=0x7f08c34ad720, this=<synthetic pointer>) at src/third_party/boost-1.70.0/boost/smart_ptr/intrusive_ptr.hpp:197
#13 mongo::SharedSemiFuture<mongo::ReplIndexBuildState::IndexCatalogStats>::waitNoThrow (interruptible=0x7f08c34ad720, this=<synthetic pointer>) at src/mongo/util/future.h:938
#14 mongo::IndexBuildsCoordinator::applyCommitIndexBuild (this=this@entry=0x7f08c30a3620, opCtx=opCtx@entry=0x7f08c34ad720, oplogEntry=...) at src/mongo/db/index_builds_coordinator.cpp:892
#15 0x00007f08cb573962 in mongo::repl::(anonymous namespace)::<lambda(mongo::OperationContext*, const mongo::repl::OplogEntry&, mongo::repl::OplogApplication::Mode)>::operator() (__closure=<optimized out>, mode=<optimized out>, entry=..., opCtx=0x7f08c34ad720) at src/third_party/boost-1.70.0/boost/optional/detail/optional_aligned_storage.hpp:64
#16 std::_Function_handler<mongo::Status(mongo::OperationContext*, const mongo::repl::OplogEntry&, mongo::repl::OplogApplication::Mode), mongo::repl::(anonymous namespace)::<lambda(mongo::OperationContext*, const mongo::repl::OplogEntry&, mongo::repl::OplogApplication::Mode)> >::_M_invoke(const std::_Any_data &, mongo::OperationContext *&&, const mongo::repl::OplogEntry &, mongo::repl::OplogApplication::Mode &&) (__functor=..., __args#0=<optimized out>, __args#1=..., __args#2=<optimized out>) at /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.Fno/include/c++/8.2.0/bits/std_function.h:283
#17 0x00007f08cb57bdd4 in std::function<mongo::Status (mongo::OperationContext*, mongo::repl::OplogEntry const&, mongo::repl::OplogApplication::Mode)>::operator()(mongo::OperationContext*, mongo::repl::OplogEntry const&, mongo::repl::OplogApplication::Mode) const (__args#2=<optimized out>, __args#1=..., __args#0=<optimized out>, this=0x7f08c3595a80) at /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.Fno/include/c++/8.2.0/bits/std_function.h:682
#18 mongo::repl::<lambda()>::operator() (__closure=<optimized out>) at src/mongo/db/repl/oplog.cpp:1557
#19 mongo::repl::applyCommand_inlock (opCtx=0x7f08c34ad720, entry=..., mode=mongo::repl::OplogApplication::Mode::kRecovering) at src/mongo/db/repl/oplog.cpp:1561

Comment by Samyukta Lanka [ 19/Aug/20 ]

The other two nodes in the set have applied something from a different term than the optime that the node is waiting on:

[ReplicaSetFixture:job0:node1] 2020-08-19T03:03:45.326+0000 | 2020-08-19T03:03:45.326+00:00 I  STORAGE  4847600 [IndexBuildsCoordinatorMongod-0] "Index build: waiting for last optime before interceptors to be majority committed","attr":{"buildUUID":{"uuid":{"$uuid":"ef1c6532-9420-4332-80d5-8cc8537a4900"}},"lastOpTime":{"ts":{"$timestamp":{"t":1597806220,"i":23}},"t":2}}
...
[ReplicaSetFixture:job0:node0] 2020-08-19T03:03:45.634+0000 | 2020-08-19T03:03:45.633+00:00 D2 REPL_HB  24097   [conn14] "Generated heartbeat response","attr":{"from":"localhost:20002","response":{"ok":1,"electionTime":{"$date":{"$numberLong":"6862525460245381144"}},"state":1,"v":5,"configTerm":3,"set":"rs","term":3,"primaryId":0,"durableOpTime":{"ts":{"$timestamp":{"t":1597806221,"i":2}},"t":3},"durableWallTime":{"$date":"2020-08-19T03:03:41.643Z"},"opTime":{"ts":{"$timestamp":{"t":1597806221,"i":2}},"t":3},"wallTime":{"$date":"2020-08-19T03:03:41.643Z"}}}
...
[ReplicaSetFixture:job0:node2] 2020-08-19T03:03:46.619+0000 | 2020-08-19T03:03:46.618+00:00 D2 REPL_HB  24097   [conn79] "Generated heartbeat response","attr":{"from":"localhost:20000","response":{"ok":1,"state":2,"v":5,"configTerm":3,"set":"rs","syncingTo":"localhost:20000","term":3,"primaryId":0,"durableOpTime":{"ts":{"$timestamp":{"t":1597806221,"i":2}},"t":3},"durableWallTime":{"$date":"2020-08-19T03:03:41.643Z"},"opTime":{"ts":{"$timestamp":{"t":1597806221,"i":2}},"t":3},"wallTime":{"$date":"2020-08-19T03:03:41.643Z"}}}

Generated at Thu Feb 08 05:22:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.