[SERVER-24776] Unable to recover after fassert(): too many open files Created: 24/Jun/16  Updated: 14/Jul/16  Resolved: 24/Jun/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: zeerak waseem Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File collect.py     File linguistic_feats.py     Text File mongo.log    
Operating System: ALL
Steps To Reproduce:

1) Write to database, get 'too many files' exception
2) Restart daemon (see everything has gone)
3) Try to recover to new directory (/data/db1)
4) Restart daemon (everything still gone)
5) Restart machine
6) Be unable to restart daemon

Participants:

 Description   

Hey guys,

My databases and all collections just disappeared after a crash.
It looks something like https://jira.mongodb.org/browse/SERVER-19774.
It started off with the crash (could restart daemon but without databases and collections) and now I can no longer start the daemon. (I have about a week to finish running experiments for my thesis so any help on recover is much appreciated).

I've described the issue here: http://stackoverflow.com/questions/38004375/mongodb-databases-disappeared?noredirect=1#comment63461612_38004375

Please let me know whatever you need and I'll make sure you get it.

It should be noted last I checked one database was 5GB and the other was < 500 MB.



 Comments   
Comment by zeerak waseem [ 25/Jun/16 ]

Hey, so turns out you weren't quite right. Changing ulimit (250 -> 1000) was the step that did not allow me to restart the daemon afterwards.
Another thing is that moving the journaling allowed me tor restart it. Here's the error:

[/data] mongod
2016-06-25T04:41:26.988+0200 I CONTROL [initandlisten] MongoDB starting : pid=12684 port=27017 dbpath=/data/db 64-bit host=Z.local
2016-06-25T04:41:26.989+0200 I CONTROL [initandlisten] db version v3.2.0
2016-06-25T04:41:26.989+0200 I CONTROL [initandlisten] git version: 45d947729a0315accb6d4f15a6b06be6d9c19fe7
2016-06-25T04:41:26.989+0200 I CONTROL [initandlisten] allocator: system
2016-06-25T04:41:26.989+0200 I CONTROL [initandlisten] modules: none
2016-06-25T04:41:26.989+0200 I CONTROL [initandlisten] build environment:
2016-06-25T04:41:26.989+0200 I CONTROL [initandlisten] distarch: x86_64
2016-06-25T04:41:26.989+0200 I CONTROL [initandlisten] target_arch: x86_64
2016-06-25T04:41:26.989+0200 I CONTROL [initandlisten] options: {}
2016-06-25T04:41:26.989+0200 I - [initandlisten] Detected data files in /data/db created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2016-06-25T04:41:26.992+0200 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=9G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2016-06-25T04:41:27.273+0200 I NETWORK [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-06-25T04:41:27.273+0200 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/data/db/diagnostic.data'
2016-06-25T04:41:27.275+0200 I NETWORK [initandlisten] waiting for connections on port 27017
2016-06-25T04:41:29.498+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:61481 #1 (1 connection now open)
2016-06-25T04:41:50.725+0200 I NETWORK [conn1] end connection 127.0.0.1:61481 (0 connections now open)
^C2016-06-25T04:41:52.591+0200 I CONTROL [signalProcessingThread] got signal 2 (Interrupt: 2), will terminate after current cmd ends
2016-06-25T04:41:52.591+0200 I FTDC [signalProcessingThread] Shutting down full-time diagnostic data capture
2016-06-25T04:41:52.595+0200 I CONTROL [signalProcessingThread] now exiting
2016-06-25T04:41:52.595+0200 I NETWORK [signalProcessingThread] shutdown: going to close listening sockets...
2016-06-25T04:41:52.595+0200 I NETWORK [signalProcessingThread] closing listening socket: 5
2016-06-25T04:41:52.595+0200 I NETWORK [signalProcessingThread] closing listening socket: 6
2016-06-25T04:41:52.595+0200 I NETWORK [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
2016-06-25T04:41:52.595+0200 I NETWORK [signalProcessingThread] shutdown: going to flush diaglog...
2016-06-25T04:41:52.595+0200 I NETWORK [signalProcessingThread] shutdown: going to close sockets...
2016-06-25T04:41:52.596+0200 I STORAGE [signalProcessingThread] WiredTigerKVEngine shutting down
2016-06-25T04:41:52.809+0200 I STORAGE [signalProcessingThread] shutdown: removing fs lock...
2016-06-25T04:41:52.810+0200 I CONTROL [signalProcessingThread] dbexit: rc: 0

What fixed that was to move the journals out of the database path which allows me to restart the daemon. Moving it back and restarting the daemon allows me to start it, but it still does not produce any of the records I had previously put into it. Primarily saying this for completeness. I'll post it to the mailing list once I have time.

I'll try updating to latest patch, but not hoping for much there.

Comment by Ramon Fernandez Marina [ 24/Jun/16 ]

Please see the documentation for recommended ulimit settings, that should do the trick.

Comment by zeerak waseem [ 24/Jun/16 ]

Hey Ramon, thanks for the quick answer. I I'll post it to the user group as well but (since my experiments for my thesis are to be done in a couple of days I'll ask here too) I tried to increase the max number of files which was the change that made me unable to start the daemon as well.
Do you have an idea as to why that is? I can't see anything in the logs for that.

Comment by Ramon Fernandez Marina [ 24/Jun/16 ]

Adding the beginning of the log and the fatal assertion part for convenience:

2016-01-18T19:31:42.152+0100 I CONTROL  [initandlisten] MongoDB starting : pid=36789 port=27017 dbpath=/usr/local/var/mongodb 64-bit host=Z.local
2016-01-18T19:31:42.153+0100 I CONTROL  [initandlisten] db version v3.2.0
2016-01-18T19:31:42.153+0100 I CONTROL  [initandlisten] git version: 45d947729a0315accb6d4f15a6b06be6d9c19fe7
2016-01-18T19:31:42.153+0100 I CONTROL  [initandlisten] allocator: system
2016-01-18T19:31:42.153+0100 I CONTROL  [initandlisten] modules: none
2016-01-18T19:31:42.153+0100 I CONTROL  [initandlisten] build environment:
2016-01-18T19:31:42.153+0100 I CONTROL  [initandlisten]     distarch: x86_64
2016-01-18T19:31:42.153+0100 I CONTROL  [initandlisten]     target_arch: x86_64
2016-01-18T19:31:42.153+0100 I CONTROL  [initandlisten] options: { config: "/usr/local/etc/mongod.conf", net: { bindIp: "127.0.0.1" }, storage: { dbPath: "/usr/local/var/mongodb" }, systemLog: { destination: "file", logAppend: true, path: "/usr/local/var/log/mongodb/mongo.log" } }
2016-01-18T19:31:42.154+0100 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=9G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
...
2016-06-24T01:17:51.074+0200 E STORAGE  [thread1] WiredTiger (24) [1466723871:73125][473:0x70000020a000], log-server: /usr/local/var/mongodb/journal: opendir: Too many open files
2016-06-24T01:17:51.074+0200 E STORAGE  [thread1] WiredTiger (24) [1466723871:74536][473:0x70000020a000], log-server: dirlist journal prefix WiredTigerPreplog: Too many open files
2016-06-24T01:17:51.074+0200 E STORAGE  [thread1] WiredTiger (24) [1466723871:74568][473:0x70000020a000], log-server: log pre-alloc server error: Too many open files
2016-06-24T01:17:51.074+0200 E STORAGE  [thread1] WiredTiger (24) [1466723871:74588][473:0x70000020a000], log-server: log server error: Too many open files
2016-06-24T01:17:51.078+0200 W FTDC     [ftdc] Uncaught exception in 'FileNotOpen Failed to open interim file /usr/local/var/mongodb/diagnostic.data/metrics.interim.temp' in full-time diagnostic data capture subsystem. Shutting down the full-time diagnostic data capture subsystem.
2016-06-24T01:18:14.124+0200 E STORAGE  [thread2] WiredTiger (24) [1466723894:124125][473:0x700000393000], file:WiredTiger.wt, WT_SESSION.checkpoint: WiredTiger.turtle: fopen: Too many open files
2016-06-24T01:18:14.127+0200 E STORAGE  [thread2] WiredTiger (24) [1466723894:126992][473:0x700000393000], checkpoint-server: checkpoint server error: Too many open files
2016-06-24T01:18:14.127+0200 E STORAGE  [thread2] WiredTiger (-31804) [1466723894:127084][473:0x700000393000], checkpoint-server: the process must exit and restart: WT_PANIC: WiredTiger library panic
2016-06-24T01:18:14.127+0200 I -        [thread2] Fatal Assertion 28558
2016-06-24T01:18:14.127+0200 I -        [thread2] 
 
***aborting after fassert() failure
 
 
2016-06-24T01:18:14.163+0200 F -        [thread2] Got signal: 6 (Abort trap: 6).

zeerakw, looks like mongod shut itself down because of insufficient resources, and the error every time you restart is the same, so you'll need to increase your limits.

Also I see you're using 3.2.0 – I'd recommend you always use the latest patch release in a given version (3.2.7 at the moment).

Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by zeerak waseem [ 24/Jun/16 ]

Scripts that caused the conflict. Specifically look at ``write_all_user_tweets`` in linguistic_feats.py

Generated at Thu Feb 08 04:07:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.