[SERVER-38084] MongoDB hangs when a part of a replica set Created: 12/Nov/18 Updated: 14/Nov/18 Resolved: 13/Nov/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.6.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ian Hannah [X] | Assignee: | Danny Hatcher (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
We have been running a single instance of Mongo for a number of years now with our application without any issues at all. We have now set up a replica set with 3 members: 1 primary, 1 secondary and an arbiter. After a couple of hours our application simply stops working with the database i.e. it hangs. Below is an extract from the database log file from the primary server. In this log the activity seems to stop at 14:09:15 with our application (running on 192.168.45.105). Address 192.168.45.106 is the secondary server. All we get after that is successful authentication and nothing else. If we restart our application it stops on the first database call. If Mongo is restarted on the primary and secondary servers and then our application restarted it works again for another couple of hours and the cycle repeats.
|
| Comments |
| Comment by Bruce Lucas (Inactive) [ 14/Nov/18 ] |
|
Hi Ian, That documentation link is for version 3.2. As of version 3.6 (https://docs.mongodb.com/v3.6/reference/configuration-options/#replication.enableMajorityReadConcern) enableMajorityReadConcern defaults to true. Bruce |
| Comment by Ian Hannah [X] [ 14/Nov/18 ] |
|
I have fixed the issue with the installer by removing elements from the registry. According to the documentation (https://docs.mongodb.com/v3.2/reference/configuration-options/#replication.enableMajorityReadConcern) enableMajorityReadConcern is false by default but I will add it to the configuration file anyway. |
| Comment by Ian Hannah [X] [ 14/Nov/18 ] |
|
Hi Danny, I am having real problems installing 3.6.8. The attached error box keeps coming up. I have completely uninstalled 3.6.2 but I cannot get past this error. Any ideas? Thanks Ian
|
| Comment by Danny Hatcher (Inactive) [ 13/Nov/18 ] |
|
Hello Ian, We started testing MongoDB with read concern majority disabled in 3.6.7 so we recommend upgrading to at least that version if you are going to disable the feature. Disabling that feature is the main way forward for you. Alternatively you can upgrade the arbiter to a data node and you should not experience these problems once all three nodes are healthy. Thanks, Danny |
| Comment by Ian Hannah [X] [ 13/Nov/18 ] |
|
Hi Danny, Can you please confirm what bug fixes/enhancements there are in 3.6.8 which will help resolve this issue? Thanks Ian |
| Comment by Danny Hatcher (Inactive) [ 13/Nov/18 ] |
|
Hello Ian, Yes, it does appear that you are experiencing heavy cache pressure (100% full) most likely due to the read concern majority issue I mentioned earlier. I strongly recommend that you disable the feature by setting --enableMajorityReadConcern to false and upgrading to 3.6.8 whenever possible. You may also wish to upgrade your RAM but that may not be necessary. As we're fairly certain we have identified the problem, I will now close this ticket. Thank you very much, Danny |
| Comment by Ian Hannah [X] [ 13/Nov/18 ] |
|
Hi Danny,
Sorry about that - files attached.
Thanks Ian
|
| Comment by Danny Hatcher (Inactive) [ 13/Nov/18 ] |
|
Hello Ian, I apologize; I wasn't clear. Could you please provide the diagnostic.data for the Primary node? Thank you, Danny |
| Comment by Ian Hannah [X] [ 13/Nov/18 ] |
|
Morning Danny, It got to about 6% at midnight and then I think that the session terminated. I have attached the log file. Thanks Ian |
| Comment by Ian Hannah [X] [ 12/Nov/18 ] |
|
HI Danny, This is the latest log statement: 2018-11-12T20:51:10.124+0000 I - [repl writer worker 9] map.childdata collection clone progress: 576093/14165331 4% (documents copied) So at this rate it is going to take more than 70 hours to synchronise which is way too long - something has to be very wrong. I have attached the diagnostic data from the last couple of days. Any help would be appreciated. Thanks Ian |
| Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ] |
|
Hello Ian, If you wish, you can provide us with the contents of your diagnostic.data folder and we can take a look at some of the internal metrics. Thank you, Danny |
| Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ] |
|
Hello Ian, You can attempt to disable read concern "majority" in your cluster by setting --enableMajorityReadConcern to false. This may reduce cache pressure on the Primary until the Secondary is back online. You may also want to investigate increasing the hardware available to the machine and it may be resource starved. From your responses, I do not see anything to indicate a bug in the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources. Thank you, Danny |
| Comment by Ian Hannah [X] [ 12/Nov/18 ] |
|
Just got another message in the logs: 2018-11-12T17:23:33.172+0000 I - [repl writer worker 10] map.childdata collection clone progress: 29975/14165331 0% (documents copied) |
| Comment by Ian Hannah [X] [ 12/Nov/18 ] |
|
Hi Danny, I have just tried to set up everything again. Attached is the mongo.log file from the secondary server. You will see this line: 2018-11-12T17:02:27.196+0000 I - [repl writer worker 10] map.childdata collection clone progress: 14064/14165331 0% (documents copied) and there are no subsequent lines with more documents copied. Any ideas why this is the case? I think that this could be the cause of my issue. Thanks Ian |
| Comment by Ian Hannah [X] [ 12/Nov/18 ] |
|
Hi Danny, Let's take one step back. Basically mongodump of this collection takes two days which implies there is an issue? The initial sync is cloning the collection and is dog slow. So I guess that the first question is why is mongodump (which is doing a similar job to clone) taking so long? Working out this issue with solve the other one I think! Thanks Ian |
| Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ] |
|
Hello Ian, If you believe that MongoDB's initial sync is too slow, you can seed the node with data files from another node through something like rsync. However, the source process needs to be stopped in order to ensure that data files are not changed during the process. It will be hard to troubleshoot either the hanging application or the collection dump slowness until you have a healthy replica set. Thank you, Danny |
| Comment by Ian Hannah [X] [ 12/Nov/18 ] |
|
Hi Danny, I understand that it needs to do an initial sync but the initial sync is taking a very long time i.e. many many hours. What is interesting is the mongodump takes over 2 days to backup this collection and the database is not that big (approx 300GB). Is there a way that I can speed up the initial sync? Please note that I was having this issue before and this is the first time that I have seen an initial sync. Thanks Ian |
| Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ] |
|
Hello Ian, By default, MongoDB will only log operations that take longer than 100ms. If you have operations running below that threshold, you would not see them in the log. However as you pointed out, this could also be evidence of a stall of some kind. It appears that your Secondary is currently in the midst of an initial sync. Until the sync is done, it will be unable to respond to read and write queries. If your application is using a write concern greater than 1, a read concern of "majority" or "linearizable", or a read preference of "secondary", the application will be waiting for the initial sync to finish before it can continue its operations. Ideally, you would replace the arbiter with another data node (when this second is finished cloning). That way, even if one data node goes down, you can still satisfy the requirements of any of the above items I mentioned. When the initial sync finishes on the Secondary, do you still experience issues with your application? Thank you, Danny |
| Comment by Ian Hannah [X] [ 12/Nov/18 ] |
|
What is interesting is on the secondary node when it currently starts up it is cloning a collection which is terribly slow: 2018-11-12T13:29:55.326+0000 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.45.107:27057 , os: { type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" } We also noticed that Mongodump is extremely slow on this collection. Could it be related?
|