Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.3.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
ClusterScalability Jul21-Aug3, ClusterScalability Aug4-18
Case:
Linked BF Score:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The resharding oplog fetchers are started in _ensureDataReplicationStarted, which is called only in the cloning and applying phases. Since SPM-2322, a new building index phase was added between cloning and applying, but this does not ensure data replication has started because it does not depend on data replication. This means that if a failover occurs while building indexes, we may go a long time (if the index builds take a while) without any oplog fetcher running at all. If this happens, once the oplog fetcher is started as part of the transition to applying, it will need to resume scanning the oplog from a point far in the past, potentially leading to stalls in the fetcher.

In the worst case, these stalls in the fetcher can be so severe that it can lead the legacy time estimation algorithm based on comparing fetched and applied oplogs to thinking that we are caught up, further leading to a critical section timeout. An example of this can be seen in HELP-78622.

is related to

SERVER-110364 Resharding data replication should not restart cloner when rebuilt after a failover while building indexes

Closed

Assignee:: Brett Nawrocki
Reporter:: Brett Nawrocki
Participants:: Brett Nawrocki, Githook User
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Jul 22 2025 08:38:29 PM UTC
Updated:: Jan 26 2026 07:58:15 AM UTC
Resolved:: Aug 16 2025 02:46:16 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates