[SERVER-30772] Can't get in to Mongo Shell for primary node, it hangs, apps could not connect to it Created: 22/Aug/17 Updated: 09/Oct/17 Resolved: 14/Sep/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Logging, Networking, Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Erin Bao | Assignee: | Mark Agarunov |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
We are running MongoDB 3.2.12 three replica set on dockers on Redhat Linux 4.1.12-61.1.25.el7uek.x86_64. It has happened a few times recently that our apps all a sudden could not connect to the set any more (mark this #1 for reference). I went on the primary node, and could not get into mongo shell to the primary, I could still get in to the two secondaries. And the replication seems still fine (I checked with db.printSlaveReplicationInfo()). One other strange thing I noticed is the MongoDB log file stopped being updated on the primary (mark this #2 for reference). It was fine on secondaries. This does not seem to be directly related with the above main issue. It could take days for #1 to happen after #2 happened from recent observations. Could you help? We are at a loss and don't know what to look for. Thanks, Erin |
| Comments |
| Comment by Erin Bao [ 15/Sep/17 ] |
|
Thank you Mark! Thanks for the confirmation. About two weeks ago, we re-directed the logs (mongodb log and docker lot) to Graylogs (I am not familiar with the technical details on this) and since then the problem has not happened again. Thank you for your effort on this topic, appreciate it! Erin |
| Comment by Mark Agarunov [ 14/Sep/17 ] |
|
Hello erinbaoathub, Taking a deeper look at the data, I do not see anything to indicate a bug in the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources. Thanks, |
| Comment by Erin Bao [ 06/Sep/17 ] |
|
Hi Mark, Where could I find the raw mongodb logs? We are using dockers to run mongodb, so if you see the logs were generated by something else, then could it be the docker? But anyway, they moved all those logs to something called Graylog, it is a UI that I could search for logs, but what showed are still about the same with what I sent you. Surprisingly after they moved the logs to Graylog on Aug 31st, it seems our system stays put longer (it has been 8 days without rebooting and they are still fine! The issue used to have every 3-4 days and we had to reboot). Anyway, I don't know what other logs to look for, the set up with 'docker' in picture gives me a lot of trouble in finding things around, versus just run mongodb on a few normal linux hosts. Thanks, |
| Comment by Mark Agarunov [ 01/Sep/17 ] |
|
Hello erinbaoathub, Thank you for providing this information. Unfortunately we still have not identified the cause of this issue. The provided logs appear to have been generated by a tool other than mongodb, would it be possible to get the raw logs from mongod for this time period? There look to be some discrepancies between the logs and diagnostic data, and I suspect there might be additional information in the raw logs that could be missing from these logs. Thanks, |
| Comment by Erin Bao [ 24/Aug/17 ] |
|
I also attached the logs from node1 and node3, just in case you'd need them: node1_logAug24,txt and node3_logAug24.txt. After we rebooted node 2 now our node 3 is our new primary. Thanks again! |
| Comment by Erin Bao [ 24/Aug/17 ] |
|
And it just happened again between our time (Chicago) 14:10 and 14:40 on Aug 24th, that will be UTC time 19:10 and 19:40 on Aug 24th. I got the diagnotistic data from the primary (which is our node 2 db02), and I will attached it as newjiraAug24.tgz. Erin |
| Comment by Erin Bao [ 24/Aug/17 ] |
|
Hi Mark, Thanks for the reply! I am now attaching three files from our three prod nodes. The last time the issue happened it was 6-7-8am on Aug 21st. I discovered the issue at 7:28am but it might have started earlier and we just didn't know. It was on node 1 (db01) which was the primary before the issue popped up. We rebooted the box around 7:40am that day. You could see the diagnostic data covering that time. Please be aware, I am talking about Chicago time, plus 5 hours it will be the UTC time in the logs. After the reboot, our current primary is node 2 (db02). You can see its log file has stopped being updated since 12:10 on Aug 21st. Please let me know if you find anything we can improve, thanks a lot! Erin |
| Comment by Mark Agarunov [ 24/Aug/17 ] |
|
Hello erinbaoathub, Thank you for the report. To get some better insight into what may be causing this, could you please provide:
This should provide some more information to better diagnose this issue. Thanks, |