[SERVER-19605] Oplog timeout should be configurable Created: 27/Jul/15  Updated: 06/Mar/19  Resolved: 13/Sep/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.7, 3.0.4
Fix Version/s: 3.4.11, 3.6.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Travis Redman Assignee: Judah Schvimer
Resolution: Done Votes: 9
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Documented
is documented by DOCS-10789 Docs for SERVER-19605: Oplog timeout ... Closed
Duplicate
is duplicated by SERVER-27952 Replication fails under heavy load - ... Closed
Related
related to SERVER-28005 Oplog query network timeout is less t... Closed
related to SERVER-38973 Allow configuration of timeouts for g... Backlog
is related to SERVER-6733 Make oplog timeout shorter Closed
is related to SERVER-26106 Raise oplog socket timeout for rollback Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.4
Sprint: Repl 2017-10-02
Participants:
Case:
Linked BF Score: 0

 Description   

We just encountered a situation where all secondaries in two of our replica sets had ceased replication, and were 1-2 days behind the primary. This appears to have been caused in part by the fact that the initial oplog query from SECONDARY->PRIMARY times out after 30 seconds, but the oplog query takes > 5 minutes to run. Some searching led me to this JIRA SERVER-6733, where the timeout was reduced from 10 minutes to 30 seconds.

As a workaround, we are building a custom binary with an increased oplog timeout so that the initial oplog query is allowed to complete and so our secondaries have a chance to catch up.

Ideally, this value would be configurable with a flag or configuration option to avoid the need to recompile, and to allow users to customize the timeout for their particular situation.

We have a fairly large oplog:

> db.printReplicationInfo()
configured oplog size:   143477.3826171875MB
log length start to end: 1620689secs (450.19hrs)
oplog first event time:  Wed Jul 08 2015 23:11:24 GMT+0000 (UTC)
oplog last event time:   Mon Jul 27 2015 17:22:53 GMT+0000 (UTC)
now:                     Mon Jul 27 2015 17:22:53 GMT+0000 (UTC)

Here are some sample queries issued by the secondaries that are timing out:

Mon Jul 27 16:32:44.469 [conn5987144] query local.oplog.rs query: { ts: { $gte: Timestamp 1437813467000|94 } } cursorid:1368021807027379 ntoreturn:0 ntoskip:0 nscanned:4205713 nscannedObjects:4205713 keyUpdates:0 numYields:33130 locks(micros) r:38390680 nreturned:101 reslen:25310 1361497ms
Mon Jul 27 16:32:45.037 [conn5987146] query local.oplog.rs query: { ts: { $gte: Timestamp 1437813467000|94 } } cursorid:1368020207769978 ntoreturn:0 ntoskip:0 nscanned:4205713 nscannedObjects:4205713 keyUpdates:0 numYields:33131 locks(micros) r:38186447 nreturned:101 reslen:25310 1362020ms



 Comments   
Comment by Githook User [ 27/Oct/17 ]

Author:

{'email': 'judah@mongodb.com', 'name': 'Judah Schvimer', 'username': 'judahschvimer'}

Message: SERVER-19605 make oplog timeout configurable

(cherry picked from commit 5dbccb4a861aa2db993dd673097a1300bcdc9cca)
Branch: v3.4
https://github.com/mongodb/mongo/commit/779bccf9dbd79aade8160a2d0e33c249ce38e286

Comment by Ramon Fernandez Marina [ 13/Sep/17 ]

Author:

{'username': u'judahschvimer', 'name': u'Judah Schvimer', 'email': u'judah@mongodb.com'}

Message:SERVER-19605 make oplog timeout configurable
Branch:master
https://github.com/mongodb/mongo/commit/5dbccb4a861aa2db993dd673097a1300bcdc9cca

Comment by Lucas [ 31/Aug/17 ]

I'm having this same issue with two different replica sets (not sharded clusters) here.

One of them is now in version 3.4.4 and the other one in 3.4.7 (we updated trying to "solve" this problem). The composition of this set is one primary, one secondary and one arbiter.

One of them holds 7k queries + 7k inserts per second (fully SSD + 10Gbps network)
The other one 5k/s commands, 700 updates/s and 41 inserts/s (fully SSD + 1Gbps network)

Firstly we were thinking the number of connections was the cause because every time our primary reached to 5k connections the replication lag began to increase (this was happening in both of those replica sets).

We changed our applications pools, and now even with 3k connections this problem still occurs. We use one of this replica set from more than 3 years, and we never had this problem before. Maybe the update to 3.4.4+ is the cause?

We also tried to change many configs (like connections per hosts, incoming connections, etc.), nothing change.

One thing we saw in our logging is the query beeing made with 60 seconds timeout but getting error logging a timeout of 5 seconds:

2017-08-31T13:46:28.825+0000 I REPL [replication-0] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 25693 – target:10.15.187.6:7308 db:local expDate:2017-08-31T13:46:28.825+0000 cmd:{ getMore: 19638055769, collection: "oplog.rs", maxTimeMS: 5000, term: 10, lastKnownCommittedOpTime: { ts: Timestamp 0|0, t: -1 } }. Last fetched optime (with hash): { ts: Timestamp 1504129528000|245, t: 8 }[6958771872602233271]. Restarts remaining: 3

2017-08-31T13:46:28.825+0000 I REPL [replication-0] Scheduled new oplog query Fetcher source: 10.15.187.6:7308 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1504129528000|245 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 10 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 25853 – target:10.15.187.6:7308 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1504129528000|245 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 10 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms

I think this maybe is just a logging issue since this ExceededTimeLimit occurs 1 time per minute.

I don't know how to prevent this replication lag, but I think this isn't only a replication issue, but an issue involving the primary performance. We are 100% sure that our primary can tolerate our load (3 years to prove!!).

We can't easily downgrade our replica sets to older versions, and we need at least one kind of workaround. I tried to downgrade one of those secondaries to 3.4.3 and the replication still gets timeout. We can't downgrade the primary since it is beeing used in production.

Thanks.

Comment by Ramon Fernandez Marina [ 26/Jul/17 ]

Thanks for the update michaelbrenden, we're really glad to hear 3.4.3 is working well for you.

Comment by Michael Brenden [ 25/Jul/17 ]

UPDATE 2017-07-25

Nightly, we have been flogging 3.4.3 with millions of rows imported at once / as quickly as possible.
This has been causing major backups in the oplog / queue, putting the primary to hard work, and consequently putting the secondaries to hard work.
Since the upgrade to 3.4.3 and its fix, we have not seen greater than 5 minute lag in any secondary.
In other words, everything 'catches up' within 5 minutes, despite flood of inserts.
At all other times during normal operation, secondaries never fall more than 10 seconds behind.
Kudos, team Mongo.

Comment by Spencer Brody (Inactive) [ 15/May/17 ]

In earlier version of MongoDB we were setting the network timeout for the initial oplog query to 10 seconds, instead of 60 seconds as we had actually intended. This was a bug and was fixed in SERVER-28005, which is included in 3.2.13+ and 3.4.3+. Due to this issue, I am not surprised, rsi-holtkamp and michaelbrenden, that upgrading to 3.4.3 helped. I hope that that fix continues to work well for you - please let us know if you have any further issues in this area.

rromano, this sounds like something else is going on. There are 2 main reasons the oplog query could take a long time: if the replicating secondary is severely lagged behind its sync source, or if the sync source is overloaded. This ticket has been left open to address the first issue - where a secondary is severely lagged, but still does have a common point between its oplog and its sync source, but finding that common point can take a long time (> 60 seconds). The second issue, as I have said, should be resolved by fixing the performance issues with the sync source. In your case, however, you are saying that there's no repl lag and the sync source is not loaded. In that case I would not expect the oplog query to ever take longer than 60 seconds. The overall size of the oplogs on either node should not have any effect on the performance of the oplog query. So it sounds like there must be something else going on that's causing the issues you are seeing. I would recommend engaging with our customer support staff or posting on the mongodb user group for assistance.

-Spencer

Comment by Michael Brenden [ 15/May/17 ]

The issue we were seeing was not due to an "overloaded primary".

Upgrade from 3.4.2 to 3.4.3 fixed the issue for us (on one secondary; we haven't tested the other secondary yet, which had to be decommissioned by this issue), but the upgrade apparently shouldn't have fixed the issue.

Comment by Robert Romano [ 05/May/17 ]

I'm seeing this same issue when doing a rolling upgrade in a 3 node replicaset. There is a large oplog but there wasn't much lag and it was a quiet period of little to no inserts or queries when we stepped down our primary, and upgraded it to latest version of mongo (3.4.4). Then when restarting it entered RECOVERING and keeps saying Blacklisting XXXX:27017 due to required optime fetcher error: 'ExceededTimeLimit: Operation timed out, request was RemoteCommand. I don't think the primary was overloaded - it just became the primary. Same for the other secondary. So, it seems like if you have a large oplog (just because you need it for other reasons) you can hit this longer query time and therefore it's not a given that you are overloaded.

Comment by Daniel Rupp [ 03/Apr/17 ]

I´d like to point out another cause of this problem which is what we are experiencing.

Slow Network and big changes.

We have a MongoDB cluster in Hong Kong and a hidden slave in our office in Europe. The bandwith is very limited, gladly the server does not have a very high change rate (it is mostly an archive system). But changes can be very big. At the moment we are stuck in a loop where the replication client tries to download the next oplog entry but hits the timeout before it is downloaded completly. It cancels the transfer and starts again. Every minute, never completing. Again and again...

Hope this gives you another view on this issue.

Edit: 3.4.3 did fix our problem.

Comment by Spencer Brody (Inactive) [ 01/Mar/17 ]

Hi,
Let me start by saying I'm sorry that you've been having problems keeping your secondaries in sync, I understand how important that is to maintaining availability for your database. While investigating this issue, I discovered a bug where even though the initial operation timeout for establishing the cursor was 60 seconds, the network layer was actually timing out the request after only 10 seconds. I fixed that bug last week in SERVER-28005, which will be backported shortly and made available in 3.4.3 and 3.2.13. Hopefully this should alleviate the problems some of you have been encountering.

I have also reviewed all the comments on this ticket, and seeing how many people have been affected by this, I have become convinced that it makes sense to expose this as a configuration option. I still agree with Thomas that 99% of the time if an oplog cursor cannot be established within 60 seconds, that indicates that the sync source is likely overloaded or experiencing other problems, and that instead of just increasing the timeout on the oplog query it would be better to address the root cause of the slowdown on the sync source. That said, however, I can imagine a legitimate case for wanting to increase the timeout, namely if the syncing node is severely repl lagged, where finding the common point with its sync source's oplog could legitimately take a long time.

Therefore, I am marking this ticket as "3.5 Required", meaning that we will address it at some point during our 3.5 development cycle. I am also marking it for consideration for backport to 3.4 - once the change has been made to the master branch, we will see how involved the change was and consider whether it can be safely included in the next stable release of 3.4.

I hope this addresses your concerns. If you have further questions or concerns about this please let me know. Also please feel free to watch this ticket for future updates and to be notified when it has been released into a stable version of MongoDB.

-Spencer

Comment by Michael Brenden [ 24/Feb/17 ]

Echoing: The problem does not seem to be with an 'overloaded primary'. It is specific to certain secondaries, while other secondaries operate perfectly well.

We can cause this problem to happen on-demand, so if 10gen needs any testbed, feedback, etc., just say so.

Comment by Jonathan Kamens [ 24/Feb/17 ]

Please be aware this log message is typically symptomatic of an overloaded primary. Therefore, while increasing the oplog timeout may prevent theses messages from being logged, it would likely not resolve the replication lag that is being observed.

That is not at all our experience.

In all instances when we have had this problem repeatedly on an ongoing basis, recompiling mongod with a longer oplog query timeout made the problem go away full stop, and allowed the secondaries to successfully replicate from the primary sufficiently fast to meet our needs.

We are currently investigating options to modify the oplog timeout.

Oh, for Pete's sake, just make the damn thing a configuration setting so that people who are having this problem can actually do something about it, rather than being forced to wait years for any sort of solution from MongoDB.

If there are indeed other underlying causes of this issue that need to be addressed, then great, address them, but in the meantime, give people what they need to treat the symptoms so their secondaries don't keep breaking. This is so obvious that it's simply mind-blowing you haven't done this despite the fact that this issue has been open for almost two years. Sheesh.

Comment by Kelsey Schubert [ 21/Feb/17 ]

Thanks for your interest in this improvement. We are currently investigating options to modify the oplog timeout.

Please be aware this log message is typically symptomatic of an overloaded primary. Therefore, while increasing the oplog timeout may prevent theses messages from being logged, it would likely not resolve the replication lag that is being observed. For MongoDB-related support discussion, I would recommend posting on the mongodb-user group or Stack Overflow with the mongodb tag. A question about how to resolve replication lag involves more discussion would be best posted on the mongodb-users group.

Comment by Michael Brenden [ 08/Feb/17 ]

I pasted a few of the reports/comments above into new ticket – https://jira.mongodb.org/browse/SERVER-27952

Comment by Michael Brenden [ 08/Feb/17 ]

Shut down all failing secondaries.
Flushed one "weak" secondary and did full resync.
On weak secondary: Moved all data indexes to another drive. Also moved local and journal dirs to another drive.
Kept "strong" (non-failing) secondaries running.
printSlaveReplicationInfo() on PRIMARY shows this "weak" secondary and all strong secondaries are 0 seconds behind PRIMARY.
Small operations work fine (del, updt, insert), hundreds or thousands, over hours, no problem.
So, everything looks fine so far.

Did truncate on collection having over 100k documents, then immediately a mass document insert (batched 100 documents at a time), probably inserting maybe 500 or 1k documents per second on PRIMARY. This is when the trouble always begins.

For first few seconds after start of mass inserts (and during the course of mass inserts, which takes several minutes), strong and weak secondaries fall behind no longer than 15 seconds and sporadically catch up. However, after about 30 seconds, the weak secondary begins to fall behind and rarely catches up (though we have seen it catch up a few times); strong secondary keeps falling behind by 15 seconds or so, and then catches up, repeat, repeat, ends up all caught up after mass insert, of up to 2 million documents. CPU load on PRIMARY (32 cores) never really changes during this entire time; CPU load on strong secondaries also never changes much; but CPU load on weak secondary goes to 4 or higher (4 cores) then settles down, after mass insert on PRIMARY ends and as failures continue to log into weak secondary.

Errors logging on weak secondary log (repeats every 10 seconds) :

[replication-2] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 13296 – target:10.11.22.33:27017 db:local expDate:2017-02-08T01:25:02.627-0500 cmd:{ getMore: 15975151664, collection: "oplog.rs", maxTimeMS: 5000, term: 7, lastKnownCommittedOpTime:

{ ts: Timestamp 1486464780000|35813, t: 7 }

}. Last fetched optime (with hash):

{ ts: Timestamp 1486533979000|316, t: 7 }

[-4870081738431509318]. Restarts remaining: 3

Comment by Vladimir [ 07/Feb/17 ]

@Travis Redman , can you please move this into the bug section to attract more attention?

Comment by Michael Brenden [ 07/Feb/17 ]

found this mod to timeout
https://github.com/mongodb/mongo/commit/5a3244c1b1424f4fbeb006f501c450074f4127f2

Comment by Michael Brenden [ 07/Feb/17 ]

Yes, that gets my vote – because if this has been going on since 2.6, and if it does routinely stop replication (as it does for us), it would seem a rather huge bug.

Just tried moving (and ln -s) oplog from standard mongo data dir to another drive on a troublesome secondary and noticed it was able to keep up for longer time period – its replication did not fail right away within the first few seconds but instead was able to keep up (i.e., replicate properly, with low ~5 to ~10 second lag) for maybe a minute or so before getting swamped and timing out. Secondary never recovers after timeout. Trying next to move all collection indexes to another drive.

Previously, we blew out the whole replicaset and tried a 2 GB oplog on the primary. It made no difference, so we blew it all out again and changed back to 20 GB oplog (which we use presently). Overall, I have to say, this feels very microsoft-y like a throwback to agonizing NT4 way of troubleshooting – change, reboot, change, reboot.

Comment by Vladimir [ 07/Feb/17 ]

May be we should to rename this issue from "Oplog timeout should be configurable" to "replication bug under huge load" ? And change the category from "Improvement" to "Bug" ?

Comment by Jonathan Kamens [ 07/Feb/17 ]

>"Problem seems related to upgrade from 3.2 to 3.4"

Nah. We see it in 2.6. It's been around for far longer than 3.2.

Comment by Michael Brenden [ 07/Feb/17 ]

Not sure. Problem happens whenever PRIMARY is given several thousand inserts over a few-second period. Some secondaries pull everything fine and keep up fine, staying up-to-date within a few seconds.. Other secondaries immediately begin to fall behind and never catch up. Problem seems related to upgrade from 3.2 to 3.4

Comment by Eric Milkie [ 06/Feb/17 ]

It's important to determine why the getMore cursor advancement on the oplog is taking over five seconds on the upstream server, as this should be nearly instantaneous. An analysis of the locking data there should take place.

Comment by Michael Brenden [ 06/Feb/17 ]

On primary and some secondaries, all of which have mongo indexes stored on SSDs, the oplog timeout is not occurring; all other secondaries (with mongo indexes stored on spinning media) exhibit the oplog timeout. Not sure why it would matter; just an observation.

Comment by Vladimir [ 05/Feb/17 ]

ok, I use docker in swam mode with multi-host encrypted network (standard setup in this docker case). I have absolute the same error.

Comment by Michael Brenden [ 05/Feb/17 ]

Previously, we discovered too-short timeouts on TCP socket settings that we thought was causing the problem. We resolved the TCP socket timeout issues, but we are still having this problem although now it is only on some subnets, some of which are routed using openvpn. Upgraded to 3.4.2 with no change / improvement. Still researching. Will report back.

Every ten seconds, this kind of error is written to the log on two (of several) secondaries:

[replication-10] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 714235 -- target:10.11.12.13:27017 db:local expDate:2017-02-05T07:13:19.734-0500 cmd:{ getMore: 15303167572, collection: "oplog.rs", maxTimeMS: 5000, term: 5, lastKnownCommittedOpTime: { ts: Timestamp 1486296764000|16608, t: 5 } }. Last fetched optime (with hash): { ts: Timestamp 1486296764000|16708, t: 5 }[-8770863745196970018]. Restarts remaining: 3
 
[replication-10] Scheduled new oplog query Fetcher source: 10.11.12.13:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1486296764000|16708 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 5 } query metadata: { $replData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 714323 -- target:10.11.12.13:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1486296764000|16708 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 5 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms

These two secondaries are connected via openvpn tunnel. Several other secondaries, some connected similarly via openvpn and some not, work perfectly well. All machines run 3.4.2.

These errors occur during secondaries applying bulk inserts made to the primary...or any inserts numbering in the several hundred documents.

Comment by Vladimir [ 05/Feb/17 ]

I have this problem with the version 3.4.1 too:

from log on the secondary. There are very many these entries:

[replication-334] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 2769905 -- target:.....
...
cmd:{ getMore: 16761306662, collection: "oplog.rs", maxTimeMS: 5000, term: 262, lastKnownCommittedOpTime: { ts: Timestamp 1486218283000|47, t: 262 } }. Last fetched optime (with hash): { ts: Timestamp 1486218285000|49, t: 262 }[3084471424447709631]. Restarts remaining: 3

db.printReplicationInfo()
configured oplog size:   9458.8388671875MB
log length start to end: 1433730secs (398.26hrs)
oplog first event time:  Thu Jan 19 2017 19:25:06 GMT+0000 (UTC)
oplog last event time:   Sun Feb 05 2017 09:40:36 GMT+0000 (UTC)
now:                     Sun Feb 05 2017 09:40:36 GMT+0000 (UTC)

Comment by Michael Brenden [ 04/Dec/16 ]

No, I meant 6 seconds.
Maybe the problem is somehow not due to the 60-second timeout, then.
One possibility is a tcp socket timeout that's too brief.
We're still moving stuff around, trying different oplog sizes, demoting / promoting replicaset secondaries to test variously sized oplogs, etc.

Onesy Twosy adds/updates to the master (and thus brief, small oplog entries) cause no problems. Duplicated a 6-row collection no problem; duplicated a 300 row collection no problem; after trying to duplicate a 71,000 row collection, the logs shows something about index type 3.4.0, and the all secondaries get confused to the point of halting all replication. Investigating this now.

Comment by Eric Milkie [ 04/Dec/16 ]

Did you mean to say 60 second timeout? That is the current value in version 3.4.0.

Comment by Michael Brenden [ 03/Dec/16 ]

We are hitting this same issue in production, mongo 3.4.0

While inserting up to 1 million records, batched at 10k records at a time.

oplog apparently grows such that it times out secondaries pitiful 6-second timeout and aborts.

Why isn't this option configurable??

Comment by Ven [ 07/Jul/16 ]

Hi Ramon,

Will see if we can get the security approval and provide some logs. Some additional color for recreation:

A/ Our "ingestion" in to the cluster includes inserts, updates AND deletes (yes, we are working to rework this part and not do a search and delete).
B/ When the sum of records (or documents) added/changed/deleted crosses around 370K/min, it alters the getmore responses so much so that it jumps past 30 seconds.
C/ In the most recent event, with inserts and updates running around 150K/min, a delete operation deleted about 650K records in 132 seconds. That was enough to push getmore response over the 30 second limit.

In the past sudden surge of inserts also have triggered similar behavior. Once this happened, with deletes continuing for another 10-15 minutes (we have a monitoring script that stops deletes if replication lag increases beyond a threshold) it was enough for oplog full scan queries to take a long time and from there everything went downhill even though all the deletes had stopped and only inserts were happening at a constant clip.

Comment by Ramon Fernandez Marina [ 06/Jul/16 ]

ven, sorry you're also running into this. Do you have logs you can share for the primary and at least one secondary node when during the high ingestion process (steps 2-5 above)? You can attach them to this ticket or upload them privately here.

We were not able to reproduce this behavior when it was first reported, but hopefully your logs will shed some light on this ticket.

Thanks,
Ramón.

Comment by Ven [ 06/Jul/16 ]

Sorry first time here...hopefully the following is relevant and helpful. Please excuse if not the right place.

We have been hitting this problem continually for the last several months. Our cluster is currently 1+3 with 2.6.12 with the biggest baddest AWS IO optimized instance. Our data size is about 3 TB per cluster ingesting about 220-250K records/min....average record size is 500 bytes; range is 200-1000 bytes

As Jonathan mentions above, ours is also a pretty busy cluster with high ingestion rates and our oplog size matches that. What we find is the following:

1/ Cluster is happy and replicating with lag < 5 seconds
2/ Sudden high ingestion happens - that is about 1.1-1.5x ingestion rate at steady happy state.
3/ getmore requests are delayed - and jumps to above 30 seconds. This causes secondaries to read timeout and start afresh.
4/ Starting afresh unfortunately seems to cause a complete oplog scan on primary - and it takes in the order of minutes given the ingestion rate. This query also times out on secondaries in 30 seconds.
5/ Each secondary repeatedly times out and issues a new query every 30 seconds. Primary accumulates all the queries and gets slower and slower.

Now unless the ingestion is totally stopped, there is no way to recover. We landed with huge number of queries pending on primary - had to stop the entire ingestion pipeline, kill all the long running oplog queries and then things started back up - but cluster is 18 hours behind.

Observations/Questions:
1/ The recovery step a secondary takes from a getmore request fails seems to cause even more load on primary. There has to be some design choices here to make to handle these errors more gracefully.
2/ Oplog full scan queries were increasing in duration forever in the above state (when ingestion was still happening). Is there some tuning that needs to be done to service these queries faster - since they seem to be critical to a secondary to get started?
3/ Is there any reason why the op log queries from secondary to primary don't have a query level timeout (like using maxTimeMS in 2.6.x) when the underlying connection read time out is 30 seconds?

In the mean time we will follow Jonathan's approach to bump the hard coded timeout to get more mileage.

Comment by Jonathan Kamens [ 17/Mar/16 ]

This just bit us in one of our production replicasets. The primary was sufficiently busy, our oplog sufficiently large, and we inserted enough documents in a short enough period of time, that scanning the table to respond to the secondary started taking not only more than 30 seconds, but upwards of 200 seconds. Our secondary stopped sync'ing because of the receive timeouts and stayed that way until I compiled a custom mongod binary with tcp_timeout set to 600 seconds in oplogreader.h and ran that mongod binary on the secondary, after which it happily started synchronizing with the primary again.

Comment by Travis Redman [ 11/Aug/15 ]

I believe it was one of the rocks nodes. Next time I see this pop up I'll grab db.currentOp() for each client IP.

Comment by Asya Kamsky [ 06/Aug/15 ]

tredman@fb.com which secondary is the one at that IP?

Comment by Travis Redman [ 06/Aug/15 ]

db.currentOp() showing how the oplog queries pile up from the same client (I filtered out the exact ip). The client disconnects, but the query keeps running. It's unclear to me whether or not this pileup effect has a significant effect on performance of subsequent queries, but I do note a lot of yielding and waiting on the lock.

PRIMARY> db.currentOp({ns: "local.oplog.rs", client: { $regex: /10.X.X.X.+/ }}).inprog
[
        {
                "opid" : -178347103,
                "active" : true,
                "secs_running" : 40,
                "microsecs_running" : NumberLong(40792588),
                "op" : "query",
                "ns" : "local.oplog.rs",
                "query" : {
                        "ts" : {
                                "$gte" : Timestamp(1438829050, 1219)
                        }
                },
                "client" : "10.X.X.X:38733",
                "desc" : "conn21511382",
                "threadId" : "0x7e3c90e64700",
                "connectionId" : 21511382,
                "locks" : {
                        "^" : "r",
                        "^local" : "R"
                },
                "waitingForLock" : true,
                "numYields" : 4217,
                "lockStats" : {
                        "timeLockedMicros" : {
                                "r" : NumberLong(15240757),
                                "w" : NumberLong(0)
                        },
                        "timeAcquiringMicros" : {
                                "r" : NumberLong(32385598),
                                "w" : NumberLong(0)
                        }
                }
        },
        {
                "opid" : -178566224,
                "active" : true,
                "secs_running" : 113,
                "microsecs_running" : NumberLong(113522052),
                "op" : "query",
                "ns" : "local.oplog.rs",
                "query" : {
                        "ts" : {
                                "$gte" : Timestamp(1438829050, 1219)
                        }
                },
                "client" : "10.X.X.X:38663",
                "desc" : "conn21509489",
                "threadId" : "0x7e3c6ed45700",
                "connectionId" : 21509489,
                "locks" : {
                        "^" : "r",
                        "^local" : "R"
                },
                "waitingForLock" : true,
                "numYields" : 13928,
                "lockStats" : {
                        "timeLockedMicros" : {
                                "r" : NumberLong(39563069),
                                "w" : NumberLong(0)
                        },
                        "timeAcquiringMicros" : {
                                "r" : NumberLong(91206970),
                                "w" : NumberLong(0)
                        }
                }
        },
        {
                "opid" : -178671802,
                "active" : true,
                "secs_running" : 149,
                "microsecs_running" : NumberLong(149248148),
                "op" : "query",
                "ns" : "local.oplog.rs",
                "query" : {
                        "ts" : {
                                "$gte" : Timestamp(1438829050, 1219)
                        }
                },
                "client" : "10.X.X.X:38622",
                "desc" : "conn21509440",
                "threadId" : "0x7eba854aa700",
                "connectionId" : 21509440,
                "locks" : {
                        "^" : "r",
                        "^local" : "R"
                },
                "waitingForLock" : true,
                "numYields" : 19665,
                "lockStats" : {
                        "timeLockedMicros" : {
                                "r" : NumberLong(53609901),
                                "w" : NumberLong(0)
                        },
                        "timeAcquiringMicros" : {
                                "r" : NumberLong(118860266),
                                "w" : NumberLong(0)
                        }
                }
        }
]

Comment by Travis Redman [ 06/Aug/15 ]

It's happening again, actually. We have a 7 node cluster (+ 2 arbiters), 4 nodes are 2.6.7, including the primary, and three hidden nodes are running a build based on 3.0.4 with mongo+rocksdb. Four nodes (1 2.6.7 MMAP and 3 3.04 rocks nodes) are behind by 13 hours. This is mongodata9 if you can see our MMS data. Seeing similarly slow oplog queries:

Thu Aug  6 16:35:33.384 [conn21503321] query local.oplog.rs query: { ts: { $gte: Timestamp 1438829050000|1219 } } cursorid:2872839288050359 ntoreturn:0 ntoskip:0 nscanned:3045214 nscannedObjects:3045214 keyUpdates:0 numYields:25463 locks(micros) r:49719642 nreturned:101 reslen:16960 210648ms
Thu Aug  6 16:36:27.059 [conn21503302] query local.oplog.rs query: { ts: { $gte: Timestamp 1438830597000|138 } } cursorid:2872840440281129 ntoreturn:0 ntoskip:0 nscanned:3737712 nscannedObjects:3737712 keyUpdates:0 numYields:31568 locks(micros) r:57307279 nreturned:101 reslen:83219 269609ms
Thu Aug  6 16:36:28.770 [conn21503348] query local.oplog.rs query: { ts: { $gte: Timestamp 1438829042000|121 } } cursorid:2872840792306064 ntoreturn:0 ntoskip:0 nscanned:3039279 nscannedObjects:3039279 keyUpdates:0 numYields:24930 locks(micros) r:51984487 nreturned:101 reslen:33013 252525ms

Here's build info from the primary.

PRIMARY> db.serverBuildInfo()
{
        "version" : "2.6.7",
        "gitVersion" : "a7d57ad27c382de82e9cb93bf983a80fd9ac9899",
        "OpenSSLVersion" : "",
        "sysInfo" : "Linux build7.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49",
        "loaderFlags" : "-fPIC -pthread -Wl,-z,now -rdynamic",
        "compilerFlags" : "-Wnon-virtual-dtor -Woverloaded-virtual -fPIC -fno-strict-aliasing -ggdb -pthread -Wall -Wsign-compare -Wno-unknown-pragmas -Winvalid-pch -pipe -Werror -O3 -Wno-unused-function -Wno-deprecated-declarations -fno-builtin-memcmp",
        "allocator" : "tcmalloc",
        "versionArray" : [
                2,
                6,
                7,
                0
        ],
        "javascriptEngine" : "V8",
        "bits" : 64,
        "debug" : false,
        "maxBsonObjectSize" : 16777216,
        "ok" : 1
}

Comment by Asya Kamsky [ 06/Aug/15 ]

tredman@fb.com can you please include db.serverBuildInfo() from the primary that this happened with?

We had an engineer try to reproduce long oplog query times but wasn't able, but I want to make sure that they are trying with similar enough setup.

Comment by Eric Milkie [ 05/Aug/15 ]

I think I would want to first understand why such queries on the oplog are taking over one thousand seconds.

Generated at Thu Feb 08 03:51:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.