[SERVER-41451] Re-starting a secondary database in a replica set generates NetworkInterfaceExceededTimeLimit errors Created: 02/Jun/19 Updated: 27/Oct/23 Resolved: 24/Feb/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.6.8 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Ian Hannah | Assignee: | Dmitry Agranat |
| Resolution: | Community Answered | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Backwards Compatibility: | Fully Compatible |
| Participants: |
| Description |
|
We have a replica set and the secondary and primary are in sync and all is good. After 7pm each evening we stop the secondary database instance, backup the server and then start it again. The secondary database instance then eventually catches up. This has worked for the last few months without any issues. Now, when the secondary database instance is restarted we get network timeout messages over and over again and the secondary gets further and further behind. We have no idea what causes these timeouts or how the timeout can be increased. Thanks Ian [^mongo log 28-05-2019.txt] |
| Comments |
| Comment by Ian Hannah [ 19/Mar/20 ] | ||
|
@here We have done some further extensive tests and we ONLY get the network errors once Mongo is lagging behind so this is clearly a Mongo issue. Once it gets behind then it cannot keep up. | ||
| Comment by Ian Hannah [ 12/Mar/20 ] | ||
|
How do you know that these network errors are not caused by Mongo. How can I diagnose what Mongo is trying to do? What is the current timeout and can this be increased? You says that this may indicate inefficient replication settings. A small oplog window would not cause network errors would it? You made a suggestion with the read concern that you thought would fix it so what was the reason behind this suggestion? | ||
| Comment by Dmitry Agranat [ 11/Mar/20 ] | ||
|
Based on your last comment:
These inconsistent and sporadic network issues might indicate some low level network issues. Alternatively, this might indicate inefficient replication settings, for example, oplog window is too small due to the generation of a lot of oplog data per hour. The SERVER project is for bugs and feature suggestions for the MongoDB server. As this ticket does not appear to be a bug, I will now close it. If you need further assistance troubleshooting, I encourage you to ask our community by posting on the MongoDB Community Forumsmongodb-user group or on Stack Overflow with the mongodb tag Regards, | ||
| Comment by Ian Hannah [ 09/Mar/20 ] | ||
|
@here I am still waiting for some response to my last message. enableMajorityReadConcern is already false and we get these errors intermittently it seems. | ||
| Comment by Ian Hannah [ 28/Feb/20 ] | ||
|
Some further information that may help. When we restart the server (after backup) most of the time the logs show these network errors. Occasionally when we restart the server we do not these network errors and the secondary works correctly and catches up. When we get the network errors it does not catch up. Is there anything special that needs to be done when starting the server to prevent these network errors? | ||
| Comment by Ian Hannah [ 27/Feb/20 ] | ||
|
As a result of these errors the secondary is getting further and further behind and cannot catch up. Then the oplog is too stale. | ||
| Comment by Ian Hannah [ 27/Feb/20 ] | ||
|
We are constantly getting this error in the log: 2020-02-27T09:27:14.631+0000 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host 192.168.45.241:27057 due to bad connection status; 1 connections to that host remain open | ||
| Comment by Ian Hannah [ 27/Feb/20 ] | ||
|
We cannot have a PSS architecture - we do not have the capacity so we have to have a PSA architecture. This is the configuration file from one of the servers: systemLog: So we already have enableMajorityReadConcern false already and we still have the same issue. Can you please advise on what we should do?
| ||
| Comment by Dmitry Agranat [ 24/Feb/20 ] | ||
|
We do not recommend a PSA architecture under the default read concern majority configuration. You can either:
I will go ahead and close this ticket but if you still experiencing issues after implementing either of the above recommendations, please reopen and upload the following data via the provided secure uploader:
| ||
| Comment by Ian Hannah [ 20/Feb/20 ] | ||
|
When I run this: I get method is not defined. Is that the correct command? We a primary, secondary and arbiter in our configuration. | ||
| Comment by Dmitry Agranat [ 12/Feb/20 ] | ||
|
I had a look at the provided data. There are a few issues with the configuration of this deployment as well as unclear members state during the reported event. For example, I can see this the log:
During the time you stopped your secondary to perform a backup, the other member of the replica set was also down (so 2 out of 3 members are down). Could you please clarify your replica set deployment and configuration by providing the output of replSetGetStatus command? | ||
| Comment by Ian Hannah [ 06/Feb/20 ] | ||
|
@Daniel Hatcher. The logs are uploaded. The secondary did eventually catch up but it took many hours. I believe that the network errors shown in the logs are causing the lag issues as this is what we saw before. We will perform the process again while you are looking into it. | ||
| Comment by Danny Hatcher (Inactive) [ 04/Feb/20 ] | ||
|
JIRA can sometimes be wonky with the @ system so no worries. Can you upload all the files to our Secure Uploader? Only MongoDB engineers will be able to access the contents. | ||
| Comment by Ian Hannah [ 04/Feb/20 ] | ||
|
Daniel Hatcher - I cannot work out how to insert you name in a comment. @ does not seem to work! | ||
| Comment by Ian Hannah [ 04/Feb/20 ] | ||
|
daniel.hatcher@mongodb.com I have the logs and diagnostic data from both servers but the zipped file for the Primary is quite large. Let me know where you want me to put the log files. | ||
| Comment by Danny Hatcher (Inactive) [ 08/Oct/19 ] | ||
|
ihannah@meniscus.co.uk, because we're not actively investigating this bug, I'm going to close it. However, please just leave a follow up comment if you get a chance to test a new version with replication and we can easily re-open the ticket and continue investigating. | ||
| Comment by Ian Hannah [ 08/Oct/19 ] | ||
|
Hi Daniel, We are using version 3.6.13 but we have not setup replication again because we are currently looking into more pressing issues Can you please keep this ticket open? I am hoping that we can setup replication again shortly. Thanks Ian | ||
| Comment by Danny Hatcher (Inactive) [ 01/Oct/19 ] | ||
|
ihannah@meniscus.co.uk, have you been able to check a later version of 3.6? | ||
| Comment by Ian Hannah [ 05/Sep/19 ] | ||
|
We have tried 3.6.13 in isolation but not in a replica set - we have had other issues to resolve. I am hoping that we can try this over the next week or two and then I will get back to you. Please keep this ticket open for the time being. | ||
| Comment by Danny Hatcher (Inactive) [ 04/Sep/19 ] | ||
|
ihannah@meniscus.co.uk have you had a chance to test 3.6.13? | ||
| Comment by Ian Hannah [ 30/Jul/19 ] | ||
|
Hi Daniel, The test system does not have replication configured. We have upgraded Mongo on the test system and it seems to have not caused any issues so I am going to install on live tonight and then configure replication next week. Thanks Ian | ||
| Comment by Danny Hatcher (Inactive) [ 23/Jul/19 ] | ||
|
I can think of reasons why the network timeouts would occur but none of them match the scenario of everything working fine until the restart and then broken afterwards. I'm hoping that 3.6.13 will either fix the problem or give us some better diagnostics to troubleshoot it. You mentioned that you have a live system and a test system. Have you ever seen this problem on the test system or is it only on the live system? Have you been able to reproduce the problem in any environment other than the one in which its occurring? Maybe there's an issue with the underlying infrastructure. | ||
| Comment by Ian Hannah [ 22/Jul/19 ] | ||
|
Hi Daniel, 1. Nothing has changed hardware wise. As I mentioned the replication works well when we have copied the db over to the secondary and then run the system. These issues come into play when the secondary has been shut down for a while. Why would we suddenly get network timeouts when the secondary comes back online when it has been working fine up until this point? What would cause network timeouts? I am confused why network timeouts occur when replication restarts. 2. It will take me a bit of time to get 3.6.13 on the live system. I will have to install it on the test system first and then onto live so this might take a week or two. Thanks Ian | ||
| Comment by Danny Hatcher (Inactive) [ 19/Jul/19 ] | ||
|
Hello Ian, Thanks for uploading the diagnostics from the Secondary. Importantly it tells us that the Secondary is replicating after it restarts. However, due to the network timeouts, the replication does not proceed fast enough so the node eventually reaches a stale state. Based on this information, I have a few follow-up questions. | ||
| Comment by Ian Hannah [ 17/Jul/19 ] | ||
|
Hi, Apologies for that - I am not sure what happened there but I have uploaded the correct diagnostics for the secondary now. So when we start replication we copy the main db to the secondary db so that all is in sync to start with. We have replication running for a few days and the secondary is no more than 1-2 seconds behind. Then we turn off the secondary to back it up and then a couple of hours later we bring it online again. This is when the secondary never catches up and the network issues appear. | ||
| Comment by Eric Sedor [ 16/Jul/19 ] | ||
|
Hi ihannah@meniscus.co.uk, I am sorry if I have not been clear: Inability for a Secondary to catch up after being down for multiple hours is not necessarily the result of a bug.
| ||
| Comment by Ian Hannah [ 15/Jul/19 ] | ||
|
Hi Joseph/Eric, Have there been any developments on this? I am very keen to resolve this issue as currently we do not have replication working. Thanks Ian | ||
| Comment by Ian Hannah [ 11/Jul/19 ] | ||
|
Hi Joseph, The latest files that I have attached have 09072019 in the name. The other two files were from the previous dump when I raised this ticket. So ignore the May ones. Does that make sense? Thanks Ian | ||
| Comment by Eric Sedor [ 10/Jul/19 ] | ||
|
Can you double-check and provide the diagnostic data for the secondary node? It looks like both [^diagnostics data - secondary.zip] and [^diagnostics data - secondary 090719.zip] only contain data from 5/25 to 5/29 | ||
| Comment by Ian Hannah [ 09/Jul/19 ] | ||
|
Hi Eric, Hopefully you have everything that you need. Please let me know if you need anything else. Thanks Ian | ||
| Comment by Ian Hannah [ 09/Jul/19 ] | ||
|
Hi Eric, The same thing happened. Replication was all working (secondary was 0-1 second behind). We shutdown the secondary last night for 2 hours, started it up again and it is lagging more and more. I have attached the log files for you. Thanks Ian | ||
| Comment by Ian Hannah [ 01/Jul/19 ] | ||
|
Hi Eric, We setup replication last week and it is all working. It started going wrong last time when we backed up the secondary. We are going to try this within the next day or two so I will keep you posted. Thanks Ian | ||
| Comment by Eric Sedor [ 28/Jun/19 ] | ||
|
Hi ihannah@meniscus.co.uk, I wanted to follow up to see if you've experienced additional incidents or if you've had a chance to perform another test. Eric | ||
| Comment by Ian Hannah [ 12/Jun/19 ] | ||
|
Hi Eric, We will need to configure the replication again to get the logs. Unfortunately my colleague is away this week but hopefully we can do this early next week and then I can get you the logs. Thanks Ian | ||
| Comment by Eric Sedor [ 11/Jun/19 ] | ||
|
We can confirm what you are seeing in terms of the Secondary getting further and further behind. When this happens it's possible that the queries the Secondary issues to the Primary for replication can time out. This is not necessarily a bug and could be happening because the Secondary had been stopped for too long. To determine if this failure to catch up is the result of a bug, we do need matching diagnostic data and log files for both the Primary and the Secondary for an incident. You can submit these to this secure upload portal. Files uploaded here are only visible to MongoDB employees and will be deleted after some time. Thanks in advance! | ||
| Comment by Ian Hannah [ 10/Jun/19 ] | ||
|
Hi Eric, I have only managed to get the logs for the secondary - the primary no longer has information going back to the end of May. Let me know if you can see anything in these logs. If necessary we'll have to go through the whole process again. | ||
| Comment by Eric Sedor [ 06/Jun/19 ] | ||
|
For both the Primary and the Secondary, can you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket? Thanks in advance! |