[SERVER-38084] MongoDB hangs when a part of a replica set Created: 12/Nov/18  Updated: 14/Nov/18  Resolved: 13/Nov/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.6.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ian Hannah [X] Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File error - mongo.png     Zip Archive metrics.zip     Zip Archive metrics.zip     Text File mongo 2.txt     Text File mongo.log    
Operating System: ALL
Participants:

 Description   

We have been running a single instance of Mongo for a number of years now with our application without any issues at all.

We have now set up a replica set with 3 members: 1 primary, 1 secondary and an arbiter.

After a couple of hours our application simply stops working with the database i.e. it hangs. Below is an extract from the database log file from the primary server. In this log the activity seems to stop at 14:09:15 with our application (running on 192.168.45.105). Address 192.168.45.106 is the secondary server. All we get after that is successful authentication and nothing else. If we restart our application it stops on the first database call. If Mongo is restarted on the primary and secondary servers and then our application restarted it works again for another couple of hours and the cycle repeats.

2018-11-09T14:09:14.973+0000 I COMMAND  [conn254] command map.$cmd command: update \{ update: "items", writeConcern: { w: 1 }, ordered: true, $db: "map" } numYields:0 reslen:119 locks:\{ Global: { acquireCount: { r: 3, w: 3 } }, Database: \{ acquireCount: { w: 3 } }, Collection: \{ acquireCount: { w: 2 } }, oplog: \{ acquireCount: { w: 1 } } } protocol:op_query 199ms
 
2018-11-09T14:09:14.991+0000 I COMMAND  [conn260] command map.data command: find \{ find: "data", filter: { _id: ObjectId('5afdb5c1238164595c201b71') }, limit: 1, singleBatch: true, $db: "map" } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:7697 locks:\{ Global: { acquireCount: { r: 4 } }, Database: \{ acquireCount: { r: 2 } }, Collection: \{ acquireCount: { r: 2 } } } protocol:op_query 116ms
 
2018-11-09T14:09:15.002+0000 I COMMAND  [conn269] command map.data command: find \{ find: "data", filter: { FName: "aw pesticide\rainfall\geometry types\subcatchment\cat ditch\projected grid" }, sort: \{ _id: -1 }, skip: 0, limit: 1, noCursorTimeout: false, $db: "map" } planSummary: IXSCAN \{ FName: 1 } keysExamined:1 docsExamined:1 hasSortStage:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:2688 locks:\{ Global: { acquireCount: { r: 4 } }, Database: \{ acquireCount: { r: 2 } }, Collection: \{ acquireCount: { r: 2 } } } protocol:op_query 126ms
 
2018-11-09T14:09:24.936+0000 I NETWORK  [listener] connection accepted from [192.168.45.106:49579|http://192.168.45.106:49579/] #328 (70 connections now open)
 
2018-11-09T14:09:24.937+0000 I NETWORK  [conn328] received client metadata from [192.168.45.106:49579|http://192.168.45.106:49579/] conn: \{ driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.2" }, os: \{ type: "Windows", name: "Microsoft Windows Server 2012 R2", architecture: "x86_64", version: "6.3 (build 9600)" } }
 
2018-11-09T14:09:24.943+0000 I ACCESS   [conn328] Successfully authenticated as principal __system on local
 
2018-11-09T14:09:33.290+0000 I NETWORK  [listener] connection accepted from [192.168.45.105:60568|http://192.168.45.105:60568/] #329 (71 connections now open)
 
2018-11-09T14:09:33.292+0000 I NETWORK  [conn329] received client metadata from [192.168.45.105:60568|http://192.168.45.105:60568/] conn: \{ driver: { name: "mongo-csharp-driver", version: "2.4.4.0" }, os: \{ type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" }
 
2018-11-09T14:09:33.428+0000 I ACCESS   [conn329] Successfully authenticated as principal menadmin on admin
 
2018-11-09T14:10:01.028+0000 I NETWORK  [listener] connection accepted from [192.168.45.105:60570|http://192.168.45.105:60570/] #330 (72 connections now open)
 
2018-11-09T14:10:01.029+0000 I NETWORK  [conn330] received client metadata from [192.168.45.105:60570|http://192.168.45.105:60570/] conn: \{ driver: { name: "mongo-csharp-driver", version: "2.4.4.0" }, os: \{ type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" }
 
2018-11-09T14:10:01.201+0000 I ACCESS   [conn330] Successfully authenticated as principal menadmin on admin
 
2018-11-09T14:10:09.791+0000 I NETWORK  [listener] connection accepted from [192.168.45.105:60571|http://192.168.45.105:60571/] #331 (73 connections now open)
 
2018-11-09T14:10:09.793+0000 I NETWORK  [conn331] received client metadata from [192.168.45.105:60571|http://192.168.45.105:60571/] conn: \{ driver: { name: "mongo-csharp-driver", version: "2.4.4.0" }, os: \{ type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" }
 
2018-11-09T14:10:09.922+0000 I ACCESS   [conn331] Successfully authenticated as principal menadmin on admin
 
2018-11-09T14:10:29.935+0000 I NETWORK  [listener] connection accepted from [192.168.45.106:49580|http://192.168.45.106:49580/] #332 (74 connections now open)
 
2018-11-09T14:10:29.936+0000 I NETWORK  [conn332] received client metadata from [192.168.45.106:49580|http://192.168.45.106:49580/] conn: \{ driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.2" }, os: \{ type: "Windows", name: "Microsoft Windows Server 2012 R2", architecture: "x86_64", version: "6.3 (build 9600)" } }
 
2018-11-09T14:10:29.941+0000 I ACCESS   [conn332] Successfully authenticated as principal __system on local
 
2018-11-09T14:11:34.935+0000 I NETWORK  [listener] connection accepted from [192.168.45.106:49581|http://192.168.45.106:49581/] #333 (75 connections now open)
 
2018-11-09T14:11:34.936+0000 I NETWORK  [conn333] received client metadata from [192.168.45.106:49581|http://192.168.45.106:49581/] conn: \{ driver: { name: "NetworkInterfaceASIO-RS", version: "3.6.2" }, os: \{ type: "Windows", name: "Microsoft Windows Server 2012 R2", architecture: "x86_64", version: "6.3 (build 9600)" } }
 
2018-11-09T14:11:34.942+0000 I ACCESS   [conn333] Successfully authenticated as principal __system on local
 
2018-11-09T14:11:51.225+0000 I NETWORK  [listener] connection accepted from [192.168.45.105:60577|http://192.168.45.105:60577/] #334 (76 connections now open)
 
2018-11-09T14:11:51.226+0000 I NETWORK  [conn334] received client metadata from [192.168.45.105:60577|http://192.168.45.105:60577/] conn: \{ driver: { name: "mongo-csharp-driver", version: "2.4.4.0" }, os: \{ type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" }
 
2018-11-09T14:11:51.368+0000 I ACCESS   [conn334] Successfully authenticated as principal menadmin on admin
 
2018-11-09T14:39:30.813+0000 I NETWORK  [listener] connection accepted from [192.168.45.105:60671|http://192.168.45.105:60671/] #335 (77 connections now open)
 
2018-11-09T14:39:30.814+0000 I NETWORK  [conn335] received client metadata from [192.168.45.105:60671|http://192.168.45.105:60671/] conn: \{ driver: { name: "mongo-csharp-driver", version: "2.4.4.0" }, os: \{ type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" }
 
2018-11-09T14:39:30.951+0000 I ACCESS   [conn335] Successfully authenticated as principal menadmin on admin
 
2018-11-09T16:13:03.502+0000 I CONTROL  [serviceShutdown] got SERVICE_CONTROL_STOP request from Windows Service Control Manager, will terminate after current cmd ends

 



 Comments   
Comment by Bruce Lucas (Inactive) [ 14/Nov/18 ]

Hi Ian,

That documentation link is for version 3.2. As of version 3.6 (https://docs.mongodb.com/v3.6/reference/configuration-options/#replication.enableMajorityReadConcern) enableMajorityReadConcern defaults to true.

Bruce

Comment by Ian Hannah [X] [ 14/Nov/18 ]

I have fixed the issue with the installer by removing elements from the registry.

According to the documentation (https://docs.mongodb.com/v3.2/reference/configuration-options/#replication.enableMajorityReadConcern) enableMajorityReadConcern is false by default but I will add it to the configuration file anyway.

Comment by Ian Hannah [X] [ 14/Nov/18 ]

Hi Danny,

I am having real problems installing 3.6.8. The attached error box keeps coming up.

I have completely uninstalled 3.6.2 but I cannot get past this error. Any ideas?

Thanks

Ian

Comment by Danny Hatcher (Inactive) [ 13/Nov/18 ]

Hello Ian,

We started testing MongoDB with read concern majority disabled in 3.6.7 so we recommend upgrading to at least that version if you are going to disable the feature. Disabling that feature is the main way forward for you. Alternatively you can upgrade the arbiter to a data node and you should not experience these problems once all three nodes are healthy.

Thanks,

Danny

Comment by Ian Hannah [X] [ 13/Nov/18 ]

Hi Danny,

Can you please confirm what bug fixes/enhancements there are in 3.6.8 which will help resolve this issue?

Thanks

Ian

Comment by Danny Hatcher (Inactive) [ 13/Nov/18 ]

Hello Ian,

Yes, it does appear that you are experiencing heavy cache pressure (100% full) most likely due to the read concern majority issue I mentioned earlier. I strongly recommend that you disable the feature by setting --enableMajorityReadConcern to false and upgrading to 3.6.8 whenever possible. You may also wish to upgrade your RAM but that may not be necessary.

As we're fairly certain we have identified the problem, I will now close this ticket.

Thank you very much,

Danny

Comment by Ian Hannah [X] [ 13/Nov/18 ]

Hi Danny,

 

Sorry about that - files attached.

 

Thanks

Ian

 

metrics.zip

Comment by Danny Hatcher (Inactive) [ 13/Nov/18 ]

Hello Ian,

I apologize; I wasn't clear. Could you please provide the diagnostic.data for the Primary node?

Thank you,

Danny

Comment by Ian Hannah [X] [ 13/Nov/18 ]

Morning Danny,

It got to about 6% at midnight and then I think that the session terminated.

I have attached the log file.

Thanks

Ian

mongo 2.txt

Comment by Ian Hannah [X] [ 12/Nov/18 ]

HI Danny,

This is the latest log statement:

2018-11-12T20:51:10.124+0000 I - [repl writer worker 9] map.childdata collection clone progress: 576093/14165331 4% (documents copied)

So at this rate it is going to take more than 70 hours to synchronise which is way too long - something has to be very wrong.

I have attached the diagnostic data from the last couple of days. Any help would be appreciated.

Thanks

Ian

metrics.zip

Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ]

Hello Ian,

If you wish, you can provide us with the contents of your diagnostic.data folder and we can take a look at some of the internal metrics.

Thank you,

Danny

Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ]

Hello Ian,

You can attempt to disable read concern "majority" in your cluster by setting --enableMajorityReadConcern to false. This may reduce cache pressure on the Primary until the Secondary is back online. You may also want to investigate increasing the hardware available to the machine and it may be resource starved.

From your responses, I do not see anything to indicate a bug in the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group.

See also our Technical Support page for additional support resources.

Thank you,

Danny

Comment by Ian Hannah [X] [ 12/Nov/18 ]

Just got another message in the logs:

2018-11-12T17:23:33.172+0000 I - [repl writer worker 10] map.childdata collection clone progress: 29975/14165331 0% (documents copied)

Comment by Ian Hannah [X] [ 12/Nov/18 ]

Hi Danny,

I have just tried to set up everything again.

Attached is the mongo.log file from the secondary server. 

You will see this line:

2018-11-12T17:02:27.196+0000 I - [repl writer worker 10] map.childdata collection clone progress: 14064/14165331 0% (documents copied)

and there are no subsequent lines with more documents copied. 

Any ideas why this is the case? I think that this could be the cause of my issue.

Thanks

Ian

mongo.log

Comment by Ian Hannah [X] [ 12/Nov/18 ]

Hi Danny,

Let's take one step back. Basically mongodump of this collection takes two days which implies there is an issue? The initial sync is cloning the collection and is dog slow.

So I guess that the first question is why is mongodump (which is doing a similar job to clone) taking so long? Working out this issue with solve the other one I think!

Thanks

Ian

Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ]

Hello Ian,

If you believe that MongoDB's initial sync is too slow, you can seed the node with data files from another node through something like rsync. However, the source process needs to be stopped in order to ensure that data files are not changed during the process.

It will be hard to troubleshoot either the hanging application or the collection dump slowness until you have a healthy replica set.

Thank you,

Danny

Comment by Ian Hannah [X] [ 12/Nov/18 ]

Hi Danny,

I understand that it needs to do an initial sync but the initial sync is taking a very long time i.e. many many hours.

What is interesting is the mongodump takes over 2 days to backup this collection and the database is not that big (approx 300GB).

Is there a way that I can speed up the initial sync? 

Please note that I was having this issue before and this is the first time that I have seen an initial sync.

Thanks

Ian

Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ]

Hello Ian,

By default, MongoDB will only log operations that take longer than 100ms. If you have operations running below that threshold, you would not see them in the log. However as you pointed out, this could also be evidence of a stall of some kind.

It appears that your Secondary is currently in the midst of an initial sync. Until the sync is done, it will be unable to respond to read and write queries. If your application is using a write concern greater than 1, a read concern of "majority" or "linearizable", or a read preference of "secondary", the application will be waiting for the initial sync to finish before it can continue its operations.

Ideally, you would replace the arbiter with another data node (when this second is finished cloning). That way, even if one data node goes down, you can still satisfy the requirements of any of the above items I mentioned.

When the initial sync finishes on the Secondary, do you still experience issues with your application?

Thank you,

Danny

Comment by Ian Hannah [X] [ 12/Nov/18 ]

What is interesting is on the secondary node when it currently starts up it is cloning a collection which is terribly slow:

2018-11-12T13:29:55.326+0000 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.45.107:27057
2018-11-12T13:29:55.333+0000 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.45.107:27057, took 7ms (3 connections now open to 192.168.45.107:27057)
2018-11-12T13:30:55.333+0000 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.45.107:27057 because the pool meets constraints; 2 connections to that host remain open
2018-11-12T13:31:45.279+0000 I - [repl writer worker 10] map.childdata collection clone progress: 41966/13942644 0% (documents copied)
2018-11-12T13:36:02.449+0000 I - [repl writer worker 10] map.childdata collection clone progress: 55261/13942644 0% (documents copied)
2018-11-12T13:40:51.232+0000 I - [repl writer worker 10] map.childdata collection clone progress: 68668/13942644 0% (documents copied)
2018-11-12T13:44:38.874+0000 I NETWORK [listener] connection accepted from 192.168.45.105:54500 #4 (3 connections now open)
2018-11-12T13:44:38.995+0000 I NETWORK [conn4] received client metadata from 192.168.45.105:54500 conn: { driver:

{ name: "mongo-csharp-driver", version: "2.4.4.0" }

, os: { type: "Windows", name: "Microsoft Windows", architecture: "x86_64" }, platform: ".NET Framework 4.7.3163.0" }
2018-11-12T13:44:39.609+0000 I ACCESS [conn4] Successfully authenticated as principal menadmin on admin
2018-11-12T13:45:15.704+0000 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.45.107:27057
2018-11-12T13:45:15.712+0000 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.45.107:27057, took 8ms (3 connections now open to 192.168.45.107:27057)
2018-11-12T13:46:15.712+0000 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.45.107:27057 because the pool meets constraints; 2 connections to that host remain open
2018-11-12T13:46:42.251+0000 I - [repl writer worker 10] map.childdata collection clone progress: 88246/13942644 0% (documents copied)
2018-11-12T13:48:54.430+0000 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.45.107:27057
2018-11-12T13:48:54.439+0000 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.45.107:27057, took 10ms (3 connections now open to 192.168.45.107:27057)
2018-11-12T13:50:07.260+0000 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.45.107:27057 because the pool meets constraints; 2 connections to that host remain open
2018-11-12T13:51:10.199+0000 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.45.107:27057
2018-11-12T13:51:10.205+0000 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.45.107:27057, took 6ms (3 connections now open to 192.168.45.107:27057)
2018-11-12T13:53:05.176+0000 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.45.107:27057 because the pool meets constraints; 2 connections to that host remain open
2018-11-12T13:53:57.616+0000 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.45.107:27057
2018-11-12T13:53:57.623+0000 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.45.107:27057, took 8ms (3 connections now open to 192.168.45.107:27057)

We also noticed that Mongodump is extremely slow on this collection. Could it be related?

 

Generated at Thu Feb 08 04:47:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.