Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.3.0-rc0, 8.2.7
Affects Version/s: None
Component/s: None
Labels:
- bf-friday

Assigned Teams:

Replication
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.2, v8.0, v7.0
Sprint:
Repl 2026-03-02
Linked BF Score:
200
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When a replica set member receives a new configuration that reduces the election timeout, the node may not run for election at the new (sooner) time. Instead, it continues to use the old (longer) timeout, causing significant delays in failover scenarios.

Location: https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/replication_coordinator_impl.cpp#L4670

PostMemberStateUpdateAction _setCurrentRSConfig(...) {
    // ... config installation ...
    
    _cancelCatchupTakeover(lk);
    _cancelPriorityTakeover(lk);
    _cancelAndRescheduleElectionTimeout(lk);  // Called here
    
    // ...
}

Location: https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L1262-L1277

void ReplicationCoordinatorImpl::_cancelAndRescheduleElectionTimeout(WithLock lk) {
    // ...
    
    if (wasActive && doNotReschedule) {
        // Only explicitly cancel if NOT rescheduling
        _handleElectionTimeoutCallback.cancel();
    }
    
    if (doNotReschedule)
        return;
    
    // Calculate new timeout from NOW
    auto requestedWhen = now + _rsConfig.unsafePeek().getElectionTimeoutPeriod();
    
    // This does NOT cancel the old callback!
    _handleElectionTimeoutCallback.delayUntilWithJitter(lk, requestedWhen, upperBound);
}

The Bug

Location: https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/delayable_timeout_callback.cpp#L93-L108

Status DelayableTimeoutCallback::_delayUntil(WithLock lk, Date_t when) {
    if (!_cbHandle) {
        // No timeout active - schedule new one
        return _reschedule(lk, when);
    }
    if (when == _nextCall) {
        // Same time - do nothing
    }
    _nextCall = when;  // ⚠️ Just updates the target time
    return Status::OK(); // ⚠️ Does NOT cancel/reschedule the callback!
}

The old callback remains scheduled in the executor at its original time. When it eventually fires, it checks:

Location: https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/delayable_timeout_callback.cpp#L134-L145

void DelayableTimeoutCallback::_handleTimeout(...) {
    // ...
    if (_nextCall > now) {
        // Too early - reschedule for the new time
        _reschedule(lk, _nextCall);
        return;
    }
    // It's time (or past time) - execute callback
    _callback(args);
}

Why This is a Problem

Moving timeout LATER: Works fine - old callback fires early, sees now < _nextCall, reschedules to new time

Moving timeout SOONER: Broken - old callback fires late, sees now >= _nextCall, executes immediately (but already late!)

Potential Fixes

Option 1: Use scheduleAt() logic in _cancelAndRescheduleElectionTimeout()

Modify _cancelAndRescheduleElectionTimeout() to detect when moving backwards and explicitly cancel:

https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp#L1239-L1295

Option 3: Always cancel and reschedule during reconfig

Simply always cancel the existing callback when installing a new config:

void ReplicationCoordinatorImpl::_cancelAndRescheduleElectionTimeout(WithLock lk) {
    // ...
    if (wasActive) {
        _handleElectionTimeoutCallback.cancel();
    }
    
    if (doNotReschedule)
        return;
    
    // Now schedule fresh
    auto requestedWhen = now + _rsConfig.unsafePeek().getElectionTimeoutPeriod();
    Milliseconds upperBound = Milliseconds(_getElectionOffsetUpperBound(lk));
    _handleElectionTimeoutCallback.delayUntilWithJitter(lk, requestedWhen, upperBound);
}

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

repro.js
Jan 29 2026 07:13:56 PM UTC
5 kB
Moustafa Maher

Assignee:: Moustafa Maher
Reporter:: Moustafa Maher
Participants:: Githook User, Moustafa Maher
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Jan 29 2026 07:06:50 PM UTC
Updated:: Mar 09 2026 03:28:14 PM UTC
Resolved:: Feb 24 2026 06:13:03 PM UTC

Details

Description

The Bug

Potential Fixes

Option 1: Use scheduleAt() logic in _cancelAndRescheduleElectionTimeout()

Option 3: Always cancel and reschedule during reconfig

Attachments

Attachments

Activity

People

Dates