-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: 4.4.9, 7.0.11, 8.0.0
-
Component/s: None
-
None
-
Replication
-
ALL
-
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
We observed a case recently where the primary degraded resulting in the following sequence of events
T0: [All secondaries] Oplog fetcher thread returned error “NetworkTimeout: Error while getting the next batch in the oplog fetcher :: caused by :: recv failed while exhausting cursor :: caused by :: Socket operation timed out”
T0 + 7s: [New Primary] Election triggers “Starting an election, since we've seen no PRIMARY in election timeout period”
T0 + 7s: [New Primary] Election succeeds, new primary assuming role
T0 + 7s: [All secondaries] Restarting heartbeat after learning of a new primary
T0 + 9s: [All secondaries] Old Primary is now in state RS_DOWN
T0 + 30s: [All secondaries] Blacklisting candidate due to error Request xxx timed out, deadline was T0+30s, op was RemoteCommand xxx – target:[a.b.c.d:27017] db:local expDate: T0+30s cmd:{ find: "oplog.rs", limit: 1, sort:
{ $natural: 1 }, projection: { ts: 1, t: 1 }, readConcern: {} }
T0 + 30s: [All secondaries] Choosing primary as sync source
T0 + 30s: [All secondaries] Scheduling fetcher to read remote Oplog
It therefore took 30s from the time sync source resolver issued a find command on the Oplog against the old primary which was apparently not responding at this stage to timeout. This timeout is hardcoded here and used here and therefore it took that long for the secondaries to reevaluate the sync source resulting in an equivalent amount of replication lag. This long period of lag is concerning because none of the secondaries can ack the writes with >w:2 write concern nor able to serve the latest reads.
Based on the above, we have a few questions which we are looking to the community to answer so we are in a position to make an appropriate fix for this problem for our setup.
- Why is the fetcher timeout on a find command that only does a sort and limits the result to 1 hardcoded to 30s?
- We ran the find command (db.runCommand({explain:{find:"oplog.rs",sort:{$natural:1},projection:{"t":1, "ts":1}, limit:1}})) in question against an Oplog as large as 700G as well as on the shard in question in the above mentioned sequence of events which had a 50G Oplog and the query took less than a second in both cases.
- What are the consequences of dropping the find timeout against the query here in order to reduce the secondary unavailability window for our use case in a scenario as is mentioned above ? What would be a good timeout number to drop to ?
- Would there be a consequence to reducing timeout value against the replSetGetRBID call here as well which uses the same timeout if we intend for the sync source resolver to fail fast and reevaluate the sync source.
Setup notes
Mongodb Version
4.4.9 although the same issue can happen in the latest versions due to the same hardcoded fetcher timeout
OS version
Ubuntu 24.04.2 LTS
Kernel version
6.8.0-1029-aws
Shard timeout settings
chainingAllowed: false,heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 5, electionTimeoutMillis: 5000, catchUpTimeoutMillis: 30000, catchUpTakeoverDelayMillis: 30000
Shard layout
5 voting members in a single cluster
- is related to
-
SERVER-10225 Replica set failover speed improvement
-
- Closed
-