[SERVER-27918] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand Created: 05/Feb/17  Updated: 18/Nov/18  Resolved: 21/Apr/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vladimir Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

1. You need to habe ca. 10 Gb Database in Replica Set with 1 master and no secondaries.
2. Add new secondary member.
3. Init sync from master
3. After init sync you should have a replication lag. Sometimes this lag occurs after 2-3 days.

Participants:

 Description   

Please help me with this very big replication lag.

db.printReplicationInfo()
configured oplog size:   9458.8388671875MB
log length start to end: 1433730secs (398.26hrs)
oplog first event time:  Thu Jan 19 2017 19:25:06 GMT+0000 (UTC)
oplog last event time:   Sun Feb 05 2017 09:40:36 GMT+0000 (UTC)
now:                     Sun Feb 05 2017 09:40:36 GMT+0000 (UTC)

Log from secondary member: There are very man these entires:

2017-02-05T10:36:09.129+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to xxx:27017
2017-02-05T10:36:09.131+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to xxx:27017
2017-02-05T10:36:19.725+0000 I REPL     [replication-334] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 2783292 -- target:xxx:27017 db:local expDate:2017-02-05T10:36:19.725+0000 cmd:{ getMore: 13866089092, collection: "oplog.rs", maxTimeMS: 5000, term: 262, lastKnownCommittedOpTime: { ts: Timestamp 1486220253000|22, t: 262 } }. Last fetched optime (with hash): { ts: Timestamp 1486220260000|13, t: 262 }[846203197011982456]. Restarts remaining: 3
2017-02-05T10:36:19.725+0000 I REPL     [replication-334] Scheduled new oplog query Fetcher source: xxx:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1486220260000|13 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 262 } query metadata: { $replData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 10000ms inShutdown: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 2783320 -- target:xxx:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1486220260000|13 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 262 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
...



 Comments   
Comment by Kelsey Schubert [ 18/Nov/18 ]

For those following along, gen's questions were addressed in SERVER-37820.

Comment by gen [ 29/Oct/18 ]

Hi @Kelsey.schubert

I have to bring this issue up again. I have been facing this problem a few days, my secondary always stuck in STARTUP2 status, mongod.log shows 

 "NetworkInterfaceExceededTimeLimit: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback: Operationtimed out"

I setup replica set from a standalone mode followed doc https://docs.mongodb.com/manual/tutorial/convert-standalone-to-replica-set/ 

the standalone instance has 10G DB data before I turned it into replica set primary. when I add secondary into this replica set, secondary always stuck in STARTUP2, and logs shows sync failed.

I have setup opog size 5120 MB, secondary runs in different machine. mongodb v3.6.8

attached file is secondary [^mongod.log] If you need any further info I would be glad to provide.

unable to upload mongod.log I upload it https://send.firefox.com/download/a30ef9a89b/#nw2RcwAeeJg9nyR4sOFBcQ

Comment by Kelsey Schubert [ 27/Jun/17 ]

Hi salzamt,

So we can investigate, would you please open a new ticket with the complete log files and diagnostic.data for each node in the affected replica set?

Thank you,
Thomas

Comment by riccardo salzer [ 27/Jun/17 ]

having the same issue, as soon as an old replica member switches the sync source to the new member, a lot of

Restarting oplog query due to error: ExceededTimeLimit: Operation timed out

occur in the logs until it switches back to a member which initially had been in the set. for us, after 24 hours we got the first replication lag warning.
as this is already quiet some time ago, let me know if you still need diagnostic data or maybe you already were able to reproduce it.

Comment by Kelsey Schubert [ 21/Apr/17 ]

Hi,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Thank you,
Thomas

Comment by Kelsey Schubert [ 16/Feb/17 ]

Hi vingrad,

Thanks for reporting this behavior. So we can investigate, would you please provide the diagnostic.data of both the primary and secondary nodes?

ckpeter,

Would you please open a new SERVER ticket and attach diagnostic.data of the primary? It's not expected that the replication lag would increase after upgrading to MongoDB 3.4.2.

Thank you for your help,
Thomas

Comment by Peter Chan [ 15/Feb/17 ]

One more update: Simply downgrading the Ubuntu secondaries to 3.2.12, with the source of the replication still on 3.4.2, did not work. The error message disappeared after downgrade, but the secondaries on Ubuntu/Mongo 3.2.12 continue to increasingly lag.

It was not until the replication source is also downgraded to 3.2.12 that the lagging secondaries catch up with the replication.

Comment by Peter Chan [ 15/Feb/17 ]

I am seeing a similar error message on my setup, with the following additional information that maybe helpful:

  • The DB was previously running fine under 3.2.11. Problem started occurring after upgrade to 3.4.2.
  • The problem seems to be occurring only on Linux / Ubuntu 16.04. A different secondary, running the same mongo version on Windows 2008 R2, does not have the error.
  • The problematic secondary on Ubuntu is currently 7.59 hours behind primary. A different instance of secondary on the same Ubuntu host, running on a different port and configured to have a slave delay of 24 hours, is currently 116 hours behind primary.
Generated at Thu Feb 08 04:16:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.