[SERVER-38973] Allow configuration of timeouts for getMores on oplog for replication Created: 14/Jan/19  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.16
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Dharshan Rangegowda Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-19605 Oplog timeout should be configurable Closed
Assigned Teams:
Replication
Participants:

 Description   

We are running into issues with oplog timeout.

 

2019-01-14T07:05:11.295+0000 I REPL [replication-175] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 1472496 – target:<xxx>:27017 db:local expDate:2019-01-14T07:05:11.295+0000 cmd:{ getMore: 16260654145, collection: "oplog.rs", maxTimeMS: 5000, term: 37, lastKnownCommittedOpTime:`

{ ts: Timestamp 1 547154618000|17, t: 37 }

}. Last fetched optime (with hash): { ts: Timestamp 1547364761000|168, t: 37 }[-6073438480613680634]. Restarts remaining: 3

 

As per instructions in text https://jira.mongodb.org/browse/SERVER-19605 we have set 

setParameter:
oplogInitialFindMaxSeconds: 600

 

Is there a separate timeout for the oplog getMore command that is not documented?

 



 Comments   
Comment by Dharshan Rangegowda [ 16/Jan/19 ]

Hi Eric,

I think its better to have the timeouts configurable instead of hardcoded - I dont see this as a broken replica. We have a replica in China in one of our scenarios and we repeatedly hit this issue. 

Comment by Eric Sedor [ 15/Jan/19 ]

Hello,

We can confirm that SERVER-19605 is for the initial find only and not additional getMores. We can consider this a feature request.

That said, because of the implications of changing behavior around oplog getMore timeouts, we would generally recommend you consider the health of the deployment using guidance from mongodb-user group or Stack Overflow with the mongodb tag. It sounds like if this timeout needs to be changed there could be serious issues with the deployment that aren't caused by oplog getMores specifically.

Does this make sense?

Generated at Thu Feb 08 04:50:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.