[SERVER-16023] Open file limit causes shutdown of WT but mongod stull running Created: 15/Oct/14  Updated: 04/Dec/14  Resolved: 10/Nov/14

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 2.8.0-rc0

Type: Bug Priority: Major - P3
Reporter: Alvin Richards (Inactive) Assignee: Mark Benvenuto
Resolution: Done Votes: 0
Labels: 28qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Problem:

On a long run, open file limit was reached, with this in the logs

2014-10-15T02:47:26.388-0400 I NETWORK  [conn19835] end connection 127.0.0.1:43859 (37 connections now open)
2014-10-15T02:47:35.350-0400 E STORAGE  WiredTiger (24) [1413355655:350274][981310047f7b8187f0000], lsm-worker: /home/ec2-user/db/collection-8096527837003006657-2064-000019.lsm: open: Too many open files
2014-10-15T02:47:35.350-0400 E STORAGE  WiredTiger (24) [1413355655:350431][981310047f7b8187f0000], lsm-worker: Failed doing LSM switch: Too many open files
2014-10-15T02:47:35.350-0400 E STORAGE  WiredTiger (0) [1413355655:350449][981310047f7b8187f0000], lsm-worker: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:47:35.350-0400 E STORAGE  WiredTiger (-31804) [1413355655:350472][981310047f7b8187f0000], lsm-worker: Error in LSM worker thread 3: WT_PANIC: WiredTiger library panic
2014-10-15T02:47:35.351-0400 E STORAGE  WiredTiger (0) [1413355655:351496][98131005777b9187f0000], cursor.set_key: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:47:35.351-0400 E STORAGE  WiredTiger (0) [1413355655:351543][98131005777b9187f0000], file:WiredTiger.wt, cursor.search: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:47:35.351-0400 E STORAGE  WiredTiger (0) [1413355655:351560][98131005777b9187f0000], file:WiredTiger.wt, cursor.close: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:47:35.351-0400 E STORAGE  WiredTiger (-31804) [1413355655:351570][98131005777b9187f0000], lsm-worker: Failed doing LSM switch: WT_PANIC: WiredTiger library panic
2014-10-15T02:47:35.351-0400 E STORAGE  WiredTiger (0) [1413355655:351580][98131005777b9187f0000], lsm-worker: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:47:35.351-0400 E STORAGE  WiredTiger (-31804) [1413355655:351596][98131005777b9187f0000], lsm-worker: Error in LSM worker thread 2: WT_PANIC: WiredTiger library panic
2014-10-15T02:47:35.352-0400 E STORAGE  WiredTiger (0) [1413355655:352124][981310067f7b9187f0000], cursor.set_key: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:47:35.352-0400 E STORAGE  WiredTiger (0) [1413355655:352166][981310067f7b9187f0000], file:WiredTiger.wt, cursor.search: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:47:35.352-0400 E STORAGE  WiredTiger (0) [1413355655:352179][981310067f7b9187f0000], file:WiredTiger.wt, cursor.close: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:47:35.352-0400 E STORAGE  WiredTiger (-31804) [1413355655:352188][981310067f7b9187f0000], lsm-worker: Failed doing LSM switch: WT_PANIC: WiredTiger library panic
2014-10-15T02:47:35.352-0400 E STORAGE  WiredTiger (0) [1413355655:352197][981310067f7b9187f0000], lsm-worker: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:47:35.352-0400 E STORAGE  WiredTiger (-31804) [1413355655:352206][981310067f7b9187f0000], lsm-worker: Error in LSM worker thread 1: WT_PANIC: WiredTiger library panic
2014-10-15T02:48:08.150-0400 E STORAGE  WiredTiger (0) [1413355688:150467][98131009777bb187f0000], cursor.set_key: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:48:08.150-0400 E STORAGE  WiredTiger (0) [1413355688:150550][98131009777bb187f0000], file:WiredTiger.wt, cursor.search: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:48:08.150-0400 E STORAGE  WiredTiger (0) [1413355688:150564][98131009777bb187f0000], file:WiredTiger.wt, cursor.close: the WiredTiger library cannot continue; the process must exit and restart
2014-10-15T02:48:08.150-0400 E STORAGE  WiredTiger (-31804) [1413355688:150575][98131009777bb187f0000], sweep-server: handle sweep server error: WT_PANIC: WiredTiger library panic
2014-10-15T12:06:37.356-0400 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:43896 #19872 (38 connections now open)
2014-10-15T12:06:37.356-0400 I NETWORK  [conn19872] end connection 127.0.0.1:43896 (37 connections now open)
2014-10-15T12:06:37.591-0400 I NETWORK  [conn3] end connection 127.0.0.1:52133 (36 connections now open)

However, mongod left running.



 Comments   
Comment by Keith Bostic [ 10/Nov/14 ]

Yes, that looks correct to me.

Comment by Eliot Horowitz (Inactive) [ 10/Nov/14 ]

Keith - take a look at this: https://github.com/mongodb/mongo/commit/4ab029e3e68ea5f7aa89b90a47fd814d17b142bf
Assuming I did what you meant correctly, then yes, I think we're good to go.

Comment by Keith Bostic [ 10/Nov/14 ]

Mark, we made a minor change to pass WT_PANIC as the error to the error handler, which means applications can figure out if a panic has happened by testing the error value.

Is that sufficient for your needs? Or are there additional advantages to having a separate panic handler?

Comment by Keith Bostic [ 10/Nov/14 ]

We've pushed changes into the WiredTiger develop branch that panic the engine if a thread fails, causing all future calls to return failure, reference https://github.com/wiredtiger/wiredtiger/pull/1356.

Comment by Githook User [ 10/Nov/14 ]

Author:

{u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}

Message: SERVER-16023: abort mongod on WT_PANIC
Branch: master
https://github.com/mongodb/mongo/commit/4ab029e3e68ea5f7aa89b90a47fd814d17b142bf

Comment by Keith Bostic [ 07/Nov/14 ]

> A handler_panic function is what we are looking for.

OK, let us think about that a little and get back to you.

> In this state, would you expect us to call storage engine shutdown, or would it be a problem if we just called _exit()?

It's safer not to call storage engine shutdown, something very, very bad has happened and continuing to run is risky at best.

Comment by Mark Benvenuto [ 07/Nov/14 ]

A handler_panic function is what we are looking for. Something like (mocked up from handle_error)

int(* WT_EVENT_HANDLER::handle_panic)(WT_EVENT_HANDLER *handler, WT_SESSION *session, int error, const char *message)

We just want the callback for logging purposes, and so that we can go through our own abrupt shutdown routines. We were not going to continue the process at this point.

In this state, would you expect us to call storage engine shutdown, or would it be a problem if we just called _exit()?

Comment by Keith Bostic [ 07/Nov/14 ]

Generally, there's no choice about what to do (for example, a panic might indicate corrupted memory, unexpected data format, or lack of resources); the only option is to exit and restart.

We could probably offer a special "we're going to panic" notification that would allow the application to immediately throw up its hands, but once we decide to panic, no future calls to the engine will succeed, they will all immediately return WT_PANIC.

Comment by Eric Milkie [ 07/Nov/14 ]

Could we have a special fatal or severe message callback function, in which we could abort the process? Then you could allow engine consumers the choice of what to do when such things happen.

Comment by Keith Bostic [ 07/Nov/14 ]

I suspect that for now, the right approach for us is to panic the engine if a thread fails so that all future calls return failure. It sounds to me like that is acceptable to you?

Working threads that see resource failure won't fail the engine (for example, an attempt to open a new table might fail), this is only the backing server threads.

Does that make sense?

Comment by Mark Benvenuto [ 07/Nov/14 ]

A simple repro

ulimit -n 100
./mongod --storageEngine=wiredtiger

In mongo shell, run the following command

for(var i = 0; i < 200; i++ ) { db.createCollection("coll" + i); }

Here are the errors I see in the log:

2014-11-07T15:11:18.491-0500 I QUERY    [conn1] command test.$cmd command: create { create: "coll42" } keyUpdates:0 numYields:0  reslen:94 5040ms
2014-11-07T15:11:18.851-0500 E STORAGE  WiredTiger (24) [1415391078:851854][1951:0x7fab72791700], archive-server: /tmp/foo/journal: opendir: Too many open files
2014-11-07T15:11:18.851-0500 E STORAGE  WiredTiger (24) [1415391078:851979][1951:0x7fab72791700], archive-server: dirlist journal prefix WiredTigerLog: Too many open files
2014-11-07T15:11:18.852-0500 E STORAGE  WiredTiger (24) [1415391078:852026][1951:0x7fab72791700], archive-server: log archive server error: Too many open files
2014-11-07T15:11:23.497-0500 E STORAGE  [conn1] WiredTiger (24) [1415391083:497585][1951:0x7fab7df47700], session.create: /tmp/foo/collection-85-2268107986619751322.wt: open: Too many open files
2014-11-07T15:11:23.497-0500 I -        [conn1] Assertion: 8:24: Too many open files
2014-11-07T15:11:23.532-0500 I CONTROL  [conn1]

There are errors in both the archive thread, and the thread allocating the table that backs the collection.

Generated at Thu Feb 08 03:39:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.