Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.4.9, 7.0.11, 8.0.0
Component/s: None
Labels:
None

Assigned Teams:

Replication
Operating System:
ALL
Steps To Reproduce:

Hide

We have not had much luck reproducing this besides the one instance we saw. However, it does seem like the sequence of events is common enough that it can recur and we want to guard against this.

Show
We have not had much luck reproducing this besides the one instance we saw. However, it does seem like the sequence of events is common enough that it can recur and we want to guard against this.
Confidence Status:
None
Work Order:
3
Size Category:
TBD

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We observed a case recently where the primary degraded resulting in the following sequence of events

T0: [All secondaries] Oplog fetcher thread returned error “NetworkTimeout: Error while getting the next batch in the oplog fetcher :: caused by :: recv failed while exhausting cursor :: caused by :: Socket operation timed out”

T0 + 7s: [New Primary] Election triggers “Starting an election, since we've seen no PRIMARY in election timeout period”

T0 + 7s: [New Primary] Election succeeds, new primary assuming role

T0 + 7s: [All secondaries] Restarting heartbeat after learning of a new primary

T0 + 9s: [All secondaries] Old Primary is now in state RS_DOWN

T0 + 30s: [All secondaries] Blacklisting candidate due to error Request xxx timed out, deadline was T0+30s, op was RemoteCommand xxx – target:[a.b.c.d:27017] db:local expDate: T0+30s cmd:{ find: "oplog.rs", limit: 1, sort:

{ $natural: 1 }

, projection: { ts: 1, t: 1 }, readConcern: {} }

T0 + 30s: [All secondaries] Choosing primary as sync source

T0 + 30s: [All secondaries] Scheduling fetcher to read remote Oplog

It therefore took 30s from the time sync source resolver issued a find command on the Oplog against the old primary which was apparently not responding at this stage to timeout. This timeout is hardcoded here and used here and therefore it took that long for the secondaries to reevaluate the sync source resulting in an equivalent amount of replication lag. This long period of lag is concerning because none of the secondaries can ack the writes with >w:2 write concern nor able to serve the latest reads.

Based on the above, we have a few questions which we are looking to the community to answer so we are in a position to make an appropriate fix for this problem for our setup.

Why is the fetcher timeout on a find command that only does a sort and limits the result to 1 hardcoded to 30s?
We ran the find command (db.runCommand({explain:{find:"oplog.rs",sort:{$natural:1},projection:{"t":1, "ts":1}, limit:1}})) in question against an Oplog as large as 700G as well as on the shard in question in the above mentioned sequence of events which had a 50G Oplog and the query took less than a second in both cases.
What are the consequences of dropping the find timeout against the query here in order to reduce the secondary unavailability window for our use case in a scenario as is mentioned above ? What would be a good timeout number to drop to ?
Would there be a consequence to reducing timeout value against the replSetGetRBID call here as well which uses the same timeout if we intend for the sync source resolver to fail fast and reevaluate the sync source.

Setup notes

Mongodb Version

4.4.9 although the same issue can happen in the latest versions due to the same hardcoded fetcher timeout

OS version

Ubuntu 24.04.2 LTS

Kernel version

6.8.0-1029-aws

Shard timeout settings

chainingAllowed: false,heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 5, electionTimeoutMillis: 5000, catchUpTimeoutMillis: 30000, catchUpTakeoverDelayMillis: 30000

Shard layout

5 voting members in a single cluster

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

secondaries_stuck_find_sync_source.js
4 kB
Jun 13 2025 04:52:56 AM UTC

is related to

SERVER-10225 Replica set failover speed improvement

Closed

Assignee:: Unassigned
Reporter:: Preeti Murthy
Participants:: Chris Kelly, Preeti Murthy
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Jun 11 2025 01:17:42 PM UTC
Updated:: Jun 24 2025 08:45:35 PM UTC

Details

Description

Mongodb Version

OS version

Kernel version

Shard timeout settings

Attachments

Attachments

Issue Links

Activity

People

Dates