[SERVER-67936] Server stuck when systemctl autorestart after mongo crash Created: 11/Jul/22  Updated: 26/Oct/22  Resolved: 26/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: moroine bentefrit Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Hi,

I'm running MongoDB 4.4 on Ubuntu 20.

I've created the following file:

# /etc/systemd/system/mongod.service.d/always_restart.conf
[Service]
Restart=always
RestartSec=60

MongoDb is on a server with other services, and I suspect an out-of-memory error. Here are the logs from journalctl:

Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]: {"t":{"$date":"2022-07-10T02:22:40.381+00:00"},"s":"F",  "c":"CONTROL",  "id":4522200, "ctx":"conn12260","msg":"Writing to log file failed, aborting appl>
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]: BACKTRACE: {"backtrace":[{"a":"AAAAB0BD15A4","b":"AAAAAD410000","o":"37C15A4"},{"a":"AAAAB0BD3E80","b":"AAAAAD410000","o":"37C3E80","s":"_ZN5mongo15print>
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAB0BD15A4","b":"AAAAAD410000","o":"37C15A4"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAB0BD3E80","b":"AAAAAD410000","o":"37C3E80","s":"_ZN5mongo15printStackTraceERSo","s+":"40"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAB0B8F4EC","b":"AAAAAD410000","o":"377F4EC","s":"_ZN5mongo5logv214FileRotateSink7consumeERKN5boost3log12v2s_mt_posix11record_viewERKNSt>
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAB0BAAEDC","b":"AAAAAD410000","o":"379AEDC","s":"_ZN5boost3log12v2s_mt_posix5sinks13unlocked_sinkIN5mongo5logv216CompositeBackendIJNS5_>
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAB0CB1A1C","b":"AAAAAD410000","o":"38A1A1C","s":"_ZN5boost3log12v2s_mt_posix4core16push_record_moveERNS1_6recordE","s+":"164"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAB0B9C734","b":"AAAAAD410000","o":"378C734","s":"_ZN5mongo5logv26detail9doLogImplEiRKNS0_11LogSeverityERKNS0_10LogOptionsENS_10StringDa>
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAAE57A000","b":"AAAAAD410000","o":"116A000"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAAE581C68","b":"AAAAAD410000","o":"1171C68","s":"_ZN5mongo9transport19ServiceStateMachine4Impl14cleanupSessionERKNS_6StatusE","s+":"F0"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAAE582024","b":"AAAAAD410000","o":"1172024","s":"_ZN5mongo9transport19ServiceStateMachine4Impl15scheduleNewLoopENS_6StatusE","s+":"2AC"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAAE582418","b":"AAAAAD410000","o":"1172418"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAAE5826AC","b":"AAAAAD410000","o":"11726AC","s":"_ZN5mongo9transport19ServiceStateMachine4Impl12startNewLoopERKNS_6StatusE","s+":"19C"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAAE582BC8","b":"AAAAAD410000","o":"1172BC8"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAB041FAA4","b":"AAAAAD410000","o":"300FAA4"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAAEBE5D94","b":"AAAAAD410000","o":"17D5D94","s":"_ZZN5mongo15unique_functionIFvvEE8makeImplIZNS_9transport15ServiceExecutor8scheduleENS>
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAB041FCC0","b":"AAAAAD410000","o":"300FCC0"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAB0423810","b":"AAAAAD410000","o":"3013810"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"AAAAB04238C4","b":"AAAAAD410000","o":"30138C4"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"FFFF81D633F0","b":"FFFF81D5C000","o":"73F0"}
Jul 10 02:22:40 pp-agg-1638951636 mongod[373116]:   Frame: {"a":"FFFF81CBB0DC","b":"FFFF81BEB000","o":"D00DC"}
Jul 10 02:22:43 pp-agg-1638951636 systemd[1]: mongod.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 02:22:43 pp-agg-1638951636 systemd[1]: mongod.service: Failed with result 'exit-code'.
Jul 10 02:22:48 pp-agg-1638951636 systemd[1]: mongod.service: Scheduled restart job, restart counter is at 1.
Jul 10 02:22:48 pp-agg-1638951636 systemd[1]: Stopped MongoDB Database Server.
Jul 10 02:22:48 pp-agg-1638951636 systemd[1]: Started MongoDB Database Server.

The issue is when I run `systemctl status mongod` it says active, and I can even see `mongod` process but I cannot access the server. Also memory usage of the process is very low (about 50Mo). Also the mongod process didn't write any startup_log, when I manually `systemctl restart mongod` it works.

My guess is the autorestart happen too fast after crash, hence the memory is not released yet. But, the `systemctl status mongod` shouldn't return active.



 Comments   
Comment by Edwin Zhou [ 26/Oct/22 ]

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by Edwin Zhou [ 04/Oct/22 ]

Hi moroine.bentefrit@gmail.com,

We're still interested in this issue but need additional information to diagnose the problem. If you hit this problem again, would you please archive (tar or zip) and upload to the secure upload portal:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)
  • the system messages and dmesg log files stored in /var/log/ covering the OOM incident

Best,
Edwin

Comment by Edwin Zhou [ 16/Sep/22 ]

Hi moroine.bentefrit@gmail.com,

If I'm understanding correctly, after the mongod was killed due to OOM, the process doesn't restart unless you manually intervene with systemctl restart mongod. This occurs on MongoDB v5.0.6 and not v4.4 as originally reported in the description.

It is unusual for the mongod to not fully restart after hitting OOM. However, it also seems that this problem is not reliably reproducible. Without diagnostic data and logs from the mongod, we will find it difficult to diagnose this issue.

If you hit this problem again, would you please archive (tar or zip) and upload to the secure upload portal:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)
  • the system messages and dmesg log files stored in /var/log/ covering the OOM incident

Best,
Edwin

Comment by moroine bentefrit [ 31/Aug/22 ]

No, I mean the bugs occurred under `5.0.6`, not `4.4`.

The bugs still persist IMO, I had only one use case making the MongoDB crash due to OOM. I tried to reproduce it, but I'm not able to.

Comment by Edwin Zhou [ 30/Aug/22 ]

Hi moroine.bentefrit@gmail.com

Thank you for the follow up! It sounds like you're no longer seeing this problem after upgrading from 4.4 to 5.0.6. Did you make any changes to your NodeJS application that helped prevent your application from using too much memory which resulted in OOM on MongoDB?

Comment by moroine bentefrit [ 30/Aug/22 ]

Thanks for checking, unfortunately, I don't have the logs anymore as it was on a PreProd server.

I forgot we updated MongoDB, hence the version was 5.0.6 and not 4.4.

I tried to reproduce but unfortunately, I'm not able to.

  • When I try to use all RAM, the server becomes unresponsive and doesn't restart.
  • When I add `MemoryLimit` on systemctl it kills MongoDB and restarts without any error.

In my server, I have a NodeJS application, which uses MongoDB. I cannot reproduce anymore but looks like it's due to the NodeJs process taking too much RAM. Then OOM killer kills both NodeJS & MongoDB but MongoDB fails to restart as mention in the bug description.

Comment by Edwin Zhou [ 29/Aug/22 ]

Hi moroine.bentefrit@gmail.com,

Thank you for your patience while I investigate this issue. Could you please provide some additional diagnostic data covering this behavior?

I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Best,
Edwin

Generated at Thu Feb 08 06:09:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.