[SERVER-19605] Oplog timeout should be configurable Created: 27/Jul/15 Updated: 06/Mar/19 Resolved: 13/Sep/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.6.7, 3.0.4 |
| Fix Version/s: | 3.4.11, 3.6.0-rc0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Travis Redman | Assignee: | Judah Schvimer |
| Resolution: | Done | Votes: | 9 |
| Labels: | neweng | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v3.4
|
||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2017-10-02 | ||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
We just encountered a situation where all secondaries in two of our replica sets had ceased replication, and were 1-2 days behind the primary. This appears to have been caused in part by the fact that the initial oplog query from SECONDARY->PRIMARY times out after 30 seconds, but the oplog query takes > 5 minutes to run. Some searching led me to this JIRA As a workaround, we are building a custom binary with an increased oplog timeout so that the initial oplog query is allowed to complete and so our secondaries have a chance to catch up. Ideally, this value would be configurable with a flag or configuration option to avoid the need to recompile, and to allow users to customize the timeout for their particular situation. We have a fairly large oplog:
Here are some sample queries issued by the secondaries that are timing out:
|
| Comments |
| Comment by Githook User [ 27/Oct/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'email': 'judah@mongodb.com', 'name': 'Judah Schvimer', 'username': 'judahschvimer'}Message: (cherry picked from commit 5dbccb4a861aa2db993dd673097a1300bcdc9cca) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 13/Sep/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'username': u'judahschvimer', 'name': u'Judah Schvimer', 'email': u'judah@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Lucas [ 31/Aug/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm having this same issue with two different replica sets (not sharded clusters) here. One of them is now in version 3.4.4 and the other one in 3.4.7 (we updated trying to "solve" this problem). The composition of this set is one primary, one secondary and one arbiter. One of them holds 7k queries + 7k inserts per second (fully SSD + 10Gbps network) Firstly we were thinking the number of connections was the cause because every time our primary reached to 5k connections the replication lag began to increase (this was happening in both of those replica sets). We changed our applications pools, and now even with 3k connections this problem still occurs. We use one of this replica set from more than 3 years, and we never had this problem before. Maybe the update to 3.4.4+ is the cause? We also tried to change many configs (like connections per hosts, incoming connections, etc.), nothing change. One thing we saw in our logging is the query beeing made with 60 seconds timeout but getting error logging a timeout of 5 seconds:
I think this maybe is just a logging issue since this ExceededTimeLimit occurs 1 time per minute. I don't know how to prevent this replication lag, but I think this isn't only a replication issue, but an issue involving the primary performance. We are 100% sure that our primary can tolerate our load (3 years to prove!!). We can't easily downgrade our replica sets to older versions, and we need at least one kind of workaround. I tried to downgrade one of those secondaries to 3.4.3 and the replication still gets timeout. We can't downgrade the primary since it is beeing used in production. Thanks. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 26/Jul/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the update michaelbrenden, we're really glad to hear 3.4.3 is working well for you. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 25/Jul/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
UPDATE 2017-07-25 Nightly, we have been flogging 3.4.3 with millions of rows imported at once / as quickly as possible. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 15/May/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In earlier version of MongoDB we were setting the network timeout for the initial oplog query to 10 seconds, instead of 60 seconds as we had actually intended. This was a bug and was fixed in rromano, this sounds like something else is going on. There are 2 main reasons the oplog query could take a long time: if the replicating secondary is severely lagged behind its sync source, or if the sync source is overloaded. This ticket has been left open to address the first issue - where a secondary is severely lagged, but still does have a common point between its oplog and its sync source, but finding that common point can take a long time (> 60 seconds). The second issue, as I have said, should be resolved by fixing the performance issues with the sync source. In your case, however, you are saying that there's no repl lag and the sync source is not loaded. In that case I would not expect the oplog query to ever take longer than 60 seconds. The overall size of the oplogs on either node should not have any effect on the performance of the oplog query. So it sounds like there must be something else going on that's causing the issues you are seeing. I would recommend engaging with our customer support staff or posting on the mongodb user group for assistance. -Spencer | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 15/May/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The issue we were seeing was not due to an "overloaded primary". Upgrade from 3.4.2 to 3.4.3 fixed the issue for us (on one secondary; we haven't tested the other secondary yet, which had to be decommissioned by this issue), but the upgrade apparently shouldn't have fixed the issue. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Robert Romano [ 05/May/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm seeing this same issue when doing a rolling upgrade in a 3 node replicaset. There is a large oplog but there wasn't much lag and it was a quiet period of little to no inserts or queries when we stepped down our primary, and upgraded it to latest version of mongo (3.4.4). Then when restarting it entered RECOVERING and keeps saying Blacklisting XXXX:27017 due to required optime fetcher error: 'ExceededTimeLimit: Operation timed out, request was RemoteCommand. I don't think the primary was overloaded - it just became the primary. Same for the other secondary. So, it seems like if you have a large oplog (just because you need it for other reasons) you can hit this longer query time and therefore it's not a given that you are overloaded. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Rupp [ 03/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I´d like to point out another cause of this problem which is what we are experiencing. Slow Network and big changes. We have a MongoDB cluster in Hong Kong and a hidden slave in our office in Europe. The bandwith is very limited, gladly the server does not have a very high change rate (it is mostly an archive system). But changes can be very big. At the moment we are stuck in a loop where the replication client tries to download the next oplog entry but hits the timeout before it is downloaded completly. It cancels the transfer and starts again. Every minute, never completing. Again and again... Hope this gives you another view on this issue. Edit: 3.4.3 did fix our problem. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 01/Mar/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, I have also reviewed all the comments on this ticket, and seeing how many people have been affected by this, I have become convinced that it makes sense to expose this as a configuration option. I still agree with Thomas that 99% of the time if an oplog cursor cannot be established within 60 seconds, that indicates that the sync source is likely overloaded or experiencing other problems, and that instead of just increasing the timeout on the oplog query it would be better to address the root cause of the slowdown on the sync source. That said, however, I can imagine a legitimate case for wanting to increase the timeout, namely if the syncing node is severely repl lagged, where finding the common point with its sync source's oplog could legitimately take a long time. Therefore, I am marking this ticket as "3.5 Required", meaning that we will address it at some point during our 3.5 development cycle. I am also marking it for consideration for backport to 3.4 - once the change has been made to the master branch, we will see how involved the change was and consider whether it can be safely included in the next stable release of 3.4. I hope this addresses your concerns. If you have further questions or concerns about this please let me know. Also please feel free to watch this ticket for future updates and to be notified when it has been released into a stable version of MongoDB. -Spencer | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 24/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Echoing: The problem does not seem to be with an 'overloaded primary'. It is specific to certain secondaries, while other secondaries operate perfectly well. We can cause this problem to happen on-demand, so if 10gen needs any testbed, feedback, etc., just say so. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Kamens [ 24/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
That is not at all our experience. In all instances when we have had this problem repeatedly on an ongoing basis, recompiling mongod with a longer oplog query timeout made the problem go away full stop, and allowed the secondaries to successfully replicate from the primary sufficiently fast to meet our needs.
Oh, for Pete's sake, just make the damn thing a configuration setting so that people who are having this problem can actually do something about it, rather than being forced to wait years for any sort of solution from MongoDB. If there are indeed other underlying causes of this issue that need to be addressed, then great, address them, but in the meantime, give people what they need to treat the symptoms so their secondaries don't keep breaking. This is so obvious that it's simply mind-blowing you haven't done this despite the fact that this issue has been open for almost two years. Sheesh. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 21/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for your interest in this improvement. We are currently investigating options to modify the oplog timeout. Please be aware this log message is typically symptomatic of an overloaded primary. Therefore, while increasing the oplog timeout may prevent theses messages from being logged, it would likely not resolve the replication lag that is being observed. For MongoDB-related support discussion, I would recommend posting on the mongodb-user group or Stack Overflow with the mongodb tag. A question about how to resolve replication lag involves more discussion would be best posted on the mongodb-users group. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 08/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I pasted a few of the reports/comments above into new ticket – https://jira.mongodb.org/browse/SERVER-27952 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 08/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Shut down all failing secondaries. Did truncate on collection having over 100k documents, then immediately a mass document insert (batched 100 documents at a time), probably inserting maybe 500 or 1k documents per second on PRIMARY. This is when the trouble always begins. For first few seconds after start of mass inserts (and during the course of mass inserts, which takes several minutes), strong and weak secondaries fall behind no longer than 15 seconds and sporadically catch up. However, after about 30 seconds, the weak secondary begins to fall behind and rarely catches up (though we have seen it catch up a few times); strong secondary keeps falling behind by 15 seconds or so, and then catches up, repeat, repeat, ends up all caught up after mass insert, of up to 2 million documents. CPU load on PRIMARY (32 cores) never really changes during this entire time; CPU load on strong secondaries also never changes much; but CPU load on weak secondary goes to 4 or higher (4 cores) then settles down, after mass insert on PRIMARY ends and as failures continue to log into weak secondary. Errors logging on weak secondary log (repeats every 10 seconds) : [replication-2] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 13296 – target:10.11.22.33:27017 db:local expDate:2017-02-08T01:25:02.627-0500 cmd:{ getMore: 15975151664, collection: "oplog.rs", maxTimeMS: 5000, term: 7, lastKnownCommittedOpTime: { ts: Timestamp 1486464780000|35813, t: 7 }}. Last fetched optime (with hash): { ts: Timestamp 1486533979000|316, t: 7 }[-4870081738431509318]. Restarts remaining: 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladimir [ 07/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Travis Redman , can you please move this into the bug section to attract more attention? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 07/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
found this mod to timeout | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 07/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, that gets my vote – because if this has been going on since 2.6, and if it does routinely stop replication (as it does for us), it would seem a rather huge bug. Just tried moving (and ln -s) oplog from standard mongo data dir to another drive on a troublesome secondary and noticed it was able to keep up for longer time period – its replication did not fail right away within the first few seconds but instead was able to keep up (i.e., replicate properly, with low ~5 to ~10 second lag) for maybe a minute or so before getting swamped and timing out. Secondary never recovers after timeout. Trying next to move all collection indexes to another drive. Previously, we blew out the whole replicaset and tried a 2 GB oplog on the primary. It made no difference, so we blew it all out again and changed back to 20 GB oplog (which we use presently). Overall, I have to say, this feels very microsoft-y like a throwback to agonizing NT4 way of troubleshooting – change, reboot, change, reboot. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladimir [ 07/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
May be we should to rename this issue from "Oplog timeout should be configurable" to "replication bug under huge load" ? And change the category from "Improvement" to "Bug" ? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Kamens [ 07/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
>"Problem seems related to upgrade from 3.2 to 3.4" Nah. We see it in 2.6. It's been around for far longer than 3.2. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 07/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Not sure. Problem happens whenever PRIMARY is given several thousand inserts over a few-second period. Some secondaries pull everything fine and keep up fine, staying up-to-date within a few seconds.. Other secondaries immediately begin to fall behind and never catch up. Problem seems related to upgrade from 3.2 to 3.4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 06/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It's important to determine why the getMore cursor advancement on the oplog is taking over five seconds on the upstream server, as this should be nearly instantaneous. An analysis of the locking data there should take place. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 06/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
On primary and some secondaries, all of which have mongo indexes stored on SSDs, the oplog timeout is not occurring; all other secondaries (with mongo indexes stored on spinning media) exhibit the oplog timeout. Not sure why it would matter; just an observation. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladimir [ 05/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
ok, I use docker in swam mode with multi-host encrypted network (standard setup in this docker case). I have absolute the same error. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 05/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Previously, we discovered too-short timeouts on TCP socket settings that we thought was causing the problem. We resolved the TCP socket timeout issues, but we are still having this problem although now it is only on some subnets, some of which are routed using openvpn. Upgraded to 3.4.2 with no change / improvement. Still researching. Will report back. Every ten seconds, this kind of error is written to the log on two (of several) secondaries:
These two secondaries are connected via openvpn tunnel. Several other secondaries, some connected similarly via openvpn and some not, work perfectly well. All machines run 3.4.2. These errors occur during secondaries applying bulk inserts made to the primary...or any inserts numbering in the several hundred documents. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladimir [ 05/Feb/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have this problem with the version 3.4.1 too: from log on the secondary. There are very many these entries:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 04/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
No, I meant 6 seconds. Onesy Twosy adds/updates to the master (and thus brief, small oplog entries) cause no problems. Duplicated a 6-row collection no problem; duplicated a 300 row collection no problem; after trying to duplicate a 71,000 row collection, the logs shows something about index type 3.4.0, and the all secondaries get confused to the point of halting all replication. Investigating this now. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 04/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Did you mean to say 60 second timeout? That is the current value in version 3.4.0. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Brenden [ 03/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We are hitting this same issue in production, mongo 3.4.0 While inserting up to 1 million records, batched at 10k records at a time. oplog apparently grows such that it times out secondaries pitiful 6-second timeout and aborts. Why isn't this option configurable?? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ven [ 07/Jul/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Ramon, Will see if we can get the security approval and provide some logs. Some additional color for recreation: A/ Our "ingestion" in to the cluster includes inserts, updates AND deletes (yes, we are working to rework this part and not do a search and delete). In the past sudden surge of inserts also have triggered similar behavior. Once this happened, with deletes continuing for another 10-15 minutes (we have a monitoring script that stops deletes if replication lag increases beyond a threshold) it was enough for oplog full scan queries to take a long time and from there everything went downhill even though all the deletes had stopped and only inserts were happening at a constant clip. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 06/Jul/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
ven, sorry you're also running into this. Do you have logs you can share for the primary and at least one secondary node when during the high ingestion process (steps 2-5 above)? You can attach them to this ticket or upload them privately here. We were not able to reproduce this behavior when it was first reported, but hopefully your logs will shed some light on this ticket. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ven [ 06/Jul/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry first time here...hopefully the following is relevant and helpful. Please excuse if not the right place. We have been hitting this problem continually for the last several months. Our cluster is currently 1+3 with 2.6.12 with the biggest baddest AWS IO optimized instance. Our data size is about 3 TB per cluster ingesting about 220-250K records/min....average record size is 500 bytes; range is 200-1000 bytes As Jonathan mentions above, ours is also a pretty busy cluster with high ingestion rates and our oplog size matches that. What we find is the following: 1/ Cluster is happy and replicating with lag < 5 seconds Now unless the ingestion is totally stopped, there is no way to recover. We landed with huge number of queries pending on primary - had to stop the entire ingestion pipeline, kill all the long running oplog queries and then things started back up - but cluster is 18 hours behind. Observations/Questions: In the mean time we will follow Jonathan's approach to bump the hard coded timeout to get more mileage. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Kamens [ 17/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This just bit us in one of our production replicasets. The primary was sufficiently busy, our oplog sufficiently large, and we inserted enough documents in a short enough period of time, that scanning the table to respond to the secondary started taking not only more than 30 seconds, but upwards of 200 seconds. Our secondary stopped sync'ing because of the receive timeouts and stayed that way until I compiled a custom mongod binary with tcp_timeout set to 600 seconds in oplogreader.h and ran that mongod binary on the secondary, after which it happily started synchronizing with the primary again. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Travis Redman [ 11/Aug/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I believe it was one of the rocks nodes. Next time I see this pop up I'll grab db.currentOp() for each client IP. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 06/Aug/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
tredman@fb.com which secondary is the one at that IP? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Travis Redman [ 06/Aug/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
db.currentOp() showing how the oplog queries pile up from the same client (I filtered out the exact ip). The client disconnects, but the query keeps running. It's unclear to me whether or not this pileup effect has a significant effect on performance of subsequent queries, but I do note a lot of yielding and waiting on the lock.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Travis Redman [ 06/Aug/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It's happening again, actually. We have a 7 node cluster (+ 2 arbiters), 4 nodes are 2.6.7, including the primary, and three hidden nodes are running a build based on 3.0.4 with mongo+rocksdb. Four nodes (1 2.6.7 MMAP and 3 3.04 rocks nodes) are behind by 13 hours. This is mongodata9 if you can see our MMS data. Seeing similarly slow oplog queries:
Here's build info from the primary.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 06/Aug/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
tredman@fb.com can you please include db.serverBuildInfo() from the primary that this happened with? We had an engineer try to reproduce long oplog query times but wasn't able, but I want to make sure that they are trying with similar enough setup. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 05/Aug/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think I would want to first understand why such queries on the oplog are taking over one thousand seconds. |