[SERVER-53273] Mongo crashing after 15-20 minutes Created: 08/Dec/20  Updated: 08/Feb/21  Resolved: 08/Feb/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: François Dubois Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Hello,

We use mongo in a Windows container. We use the image 4.2-windowsservercore-1809. 

It works in many cases, but we have a customer where mongo crashes after 15-20-30 minutes.  I saw the issue 33291 (https://jira.mongodb.org/browse/SERVER-33291), but doesn't seem to be our issue.

In log, we have nothing interesting. We receive valid log I think and we saw a reboot without any crash reason. What happen ?  No idea.  We ran the container with the env. variable MONGODB_SYSTEM_LOG_VERBOSITY=5, but nothing related to a crash. And after the crash, it can restart since there is an issue with the WiredTiger.lock file.

How could we get more information on the crash in logs ? 

Here is a few logs before it restart and the issue that we have after with the WiredTiger.lock file.  Thank you for your help.

2020-12-07T15:19:17.786+0100 I NETWORK [conn53] end connection 172.22.36.234:49349 (57 connections now open)
2020-12-07T15:19:17.787+0100 I NETWORK [conn54] end connection 172.22.36.234:49350 (56 connections now open)
2020-12-07T15:19:18.280+0100 I NETWORK [listener] connection accepted from 172.22.36.234:49355 #59 (57 connections now open)
2020-12-07T15:29:09.578+0100 I NETWORK [conn1] end connection 172.22.38.187:49413 (56 connections now open)
2020-12-07T15:29:09.582+0100 I NETWORK [listener] connection accepted from 172.22.38.187:49584 #60 (57 connections now open)
2020-12-07T15:29:09.583+0100 I NETWORK [conn60] received client metadata from 172.22.38.187:49584 conn60: \{ driver: { name: "mongo-rust-driver", version: "1.1.1" }, os: \{ type: "windows", architecture: "x86_64", version: "Windows Server 2016 10.0.17763" } 
2020-12-07T15:34:29.713+0100 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'
2020-12-07T15:34:30.470+0100 W ASIO [main] No TransportLayer configured during NetworkInterface startup
2020-12-07T15:34:30.632+0100 I CONTROL [initandlisten] MongoDB starting : pid=9740 port=27017 dbpath=C:\data\db\ 64-bit host=42eb81c3b9db
2020-12-07T15:34:30.632+0100 I CONTROL [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
2020-12-07T15:34:30.632+0100 I CONTROL [initandlisten] db version v4.2.11
2020-12-07T15:34:30.632+0100 I CONTROL [initandlisten] git version: ea38428f0c6742c7c2c7f677e73d79e17a2aab96
2020-12-07T15:34:30.632+0100 I CONTROL [initandlisten] allocator: tcmalloc
2020-12-07T15:34:30.632+0100 I CONTROL [initandlisten] modules: none
2020-12-07T15:34:30.632+0100 I CONTROL [initandlisten] build environment:
2020-12-07T15:34:30.632+0100 I CONTROL [initandlisten] distmod: 2012plus
2020-12-07T15:34:30.632+0100 I CONTROL [initandlisten] distarch: x86_64
2020-12-07T15:34:30.632+0100 I CONTROL [initandlisten] target_arch: x86_64
2020-12-07T15:34:30.632+0100 I CONTROL [initandlisten] options: \{ net: { bindIp: "*" } }
2020-12-07T15:34:30.753+0100 W STORAGE [initandlisten] Detected unclean shutdown - C:\data\db\mongod.lock is not empty.
2020-12-07T15:34:30.753+0100 I STORAGE [initandlisten] Detected data files in C:\data\db\ created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2020-12-07T15:34:30.753+0100 W STORAGE [initandlisten] Recovering data from the last clean checkpoint.
2020-12-07T15:34:30.753+0100 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=3583M,cache_overflow=(file_max=0M),session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress],
2020-12-07T15:34:30.755+0100 E STORAGE [initandlisten] WiredTiger error (16) [1607351670:754772][9740:140717493998176], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device Raw: [1607351670:754772][9740:140717493998176], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device
2020-12-07T15:34:30.755+0100 E STORAGE [initandlisten] WiredTiger error (16) [1607351670:754772][9740:140717493998176], wiredtiger_open: __conn_single, 1708: WiredTiger database is already being managed by another process: Resource device Raw: [1607351670:754772][9740:140717493998176], wiredtiger_open: __conn_single, 1708: WiredTiger database is already being managed by another process: Resource device
2020-12-07T15:34:30.755+0100 E STORAGE [initandlisten] WiredTiger error (16) [1607351670:754772][9740:140717493998176], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device Raw: [1607351670:754772][9740:140717493998176], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device
2020-12-07T15:34:30.755+0100 E STORAGE [initandlisten] WiredTiger error (16) [1607351670:754772][9740:140717493998176], wiredtiger_open: __conn_single, 1708: WiredTiger database is already being managed by another process: Resource device Raw: [1607351670:754772][9740:140717493998176], wiredtiger_open: __conn_single, 1708: WiredTiger database is already being managed by another process: Resource device
2020-12-07T15:34:30.756+0100 E STORAGE [initandlisten] WiredTiger error (16) [1607351670:754772][9740:140717493998176], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device Raw: [1607351670:754772][9740:140717493998176], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device
2020-12-07T15:34:30.756+0100 E STORAGE [initandlisten] WiredTiger error (16) [1607351670:754772][9740:140717493998176], wiredtiger_open: __conn_single, 1708: WiredTiger database is already being managed by another process: Resource device Raw: [1607351670:754772][9740:140717493998176], wiredtiger_open: __conn_single, 1708: WiredTiger database is already being managed by another process: Resource device
2020-12-07T15:34:30.756+0100 E STORAGE [initandlisten] WiredTiger error (16) [1607351670:756443][9740:140717493998176], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device Raw: [1607351670:756443][9740:140717493998176], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device
2020-12-07T15:34:30.840+0100 E STORAGE [initandlisten] WiredTiger error (16) [1607351670:839774][9740:140717493998176], wiredtiger_open: __conn_single, 1708: WiredTiger database is already being managed by another process: Resource device Raw: [1607351670:839774][9740:140717493998176], wiredtiger_open: __conn_single, 1708: WiredTiger database is already being managed by another process: Resource device
2020-12-07T15:34:30.840+0100 E STORAGE [initandlisten] WiredTiger error (16) [1607351670:840774][9740:140717493998176], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device Raw: [1607351670:840774][9740:140717493998176], wiredtiger_open: __win_file_lock, 241: C:\data\db\\WiredTiger.lock: handle-lock: LockFile: The process cannot access the file because another process has locked a portion of the file.\r\n: Resource device
2020-12-07T15:34:30.840+0100 E STORAGE [initandlisten] WiredTiger error (16) [1607351670:840774][9740:140717493998176], wiredtiger_open: __conn_single, 1708: WiredTiger database is already being managed by another process: Resource device Raw: [1607351670:840774][9740:140717493998176], wiredtiger_open: __conn_single, 1708: WiredTiger database is already being managed by another process: Resource device
2020-12-07T15:34:30.840+0100 W STORAGE [initandlisten] Failed to start up WiredTiger under any compatibility version.
2020-12-07T15:34:30.840+0100 F STORAGE [initandlisten] Reason: 16: Resource device
2020-12-07T15:34:30.841+0100 F - [initandlisten] Fatal Assertion 28595 at src\mongo\db\storage\wiredtiger\wiredtiger_kv_engine.cpp 925
2020-12-07T15:34:30.841+0100 F - [initandlisten] \n\n***aborting after fassert() failure\n\n
2020-12-07T15:34:40.278+0100 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'



 Comments   
Comment by Edwin Zhou [ 08/Feb/21 ]

Hi francois.dubois.x@gmail.com,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Edwin

Comment by Edwin Zhou [ 25/Jan/21 ]

Hi francois.dubois.x@gmail.com,

To help us diagnose this problem, could you help clarify a few things:

  1. After a Windows container restarts, the mongod will not automatically restart because of WiredTiger.lock. You end up needing to manually delete WiredTiger.lock and manually restart the mongod. Is that correct?
  2. Are you able to consistently reproduce this issue?
  3. What steps are taken leading up to the container restarting?

If you're able to reproduce this, would you be able to attach the $dbpath/diagnostic.data directory of those incidents?

Thanks,
Edwin

Comment by Edwin Zhou [ 11/Jan/21 ]

Hi francois.dubois.x@gmail.com,

Thanks for getting back to me. I'd like to clarify a few things about your issue:

  1. After a Windows container restarts, the mongod will not automatically restart because of WiredTiger.lock. You end up needing to manually delete WiredTiger.lock and manually restart the mongod. Is that correct?
  2. Are you able to consistently reproduce this issue?
  3. What steps are taken leading up to the container restarting?

If you're able to reproduce this, would you be able to attach the $dbpath/diagnostic.data directory of those incidents? We may be able to narrow down why the container is restarting.

Best,
Edwin

Comment by François Dubois [ 07/Jan/21 ]

Hi,

Sorry for the delay, I didn't receive an email for your first comment. We saw that issue on a customer setup so I don't have the information for now and I don't know if I will be able to get it. But we continued to investigate and found other things, maybe not related to mongo.

We start our container with the flag `--restart=on-failure` and I think the container is restarting after 15-20 minutes for I don't know what reason. And when it restarts, we have the issue with the WiredTiger.lock file. So the problem is more the fact that it can't restart correctly if the container restart because the file WiredTiger.lock will be an issue. You have to delete that file manually, that's bad.

We have deployed a "real" mongo server (not in a container) and we still had issue with our application, other containers restart for I don't know what reasons. So we thought the initial problem was a crash from mongo, but all containers was restarting in fact. 

Conclusion : After a Mongo container restart, the mongo server will never be back since the file WiredTiger.lock is there. If that issue was not there, we would restart all containers and everything would be back at least. Do you know why we have this issue only with Windows container ?  But I know that we have to understand the root cause on our side, why all containers restart, and that part is not related to mongo.

Thanks,

François

 

Comment by Edwin Zhou [ 05/Jan/21 ]

Hi francois.dubois.x@gmail.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please attach an archive of the diagnostic.data directory to this ticket?

Thanks,
Edwin

Comment by Edwin Zhou [ 16/Dec/20 ]

Hi francois.dubois.x@gmail.com,

Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket?

Kind regards,
Edwin

Generated at Thu Feb 08 05:30:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.