[SERVER-8515] Unable to start DB with > 1024 files after upgrading for 2.2.x to 2.4.0-rc0 Created: 11/Feb/13  Updated: 11/Jul/16  Resolved: 12/Feb/13

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 2.4.0-rc0
Fix Version/s: 2.4.0-rc1

Type: Bug Priority: Blocker - P1
Reporter: Alvin Richards (Inactive) Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS-X


Issue Links:
Depends
depends on SERVER-8521 Lazily clean up temp collections Closed
Operating System: ALL
Participants:

 Description   

Problem:
With > 1024 db files, 2.4.0-rc0 (and 2.3.2) fails to start with

Mon Feb 11 21:43:47.460 [IndexRebuilder] opening db:  fred98
Mon Feb 11 21:43:47.463 [IndexRebuilder] opening db:  fred99
Mon Feb 11 21:43:47.465 [IndexRebuilder] opening db:  local
Mon Feb 11 21:43:47.468 [websvr] admin web console waiting for connections on port 28017
Mon Feb 11 21:43:47.468 [initandlisten] waiting for connections on port 27017
Mon Feb 11 21:43:47.469 [websvr] select() failure: ret=-1 errno:22 Invalid argument
Mon Feb 11 21:43:47.469 [initandlisten] select() failure: ret=-1 errno:22 Invalid argument
Mon Feb 11 21:43:47.469 [initandlisten] now exiting

Reproduce:

> mongod --dbpath /data/db/bug --logpath /data/db/bug/server.log --fork --smallfiles --noprealloc

Create enough DB's so that startup will be successful

> mongo admin
for (i=0; i < 506; i++) { var dummyDb = db.getSisterDB( "fred" + i ).foo.insert({x:1}); }
db.shutdownServer()

Startup will be ok

> mongod --dbpath /data/db/bug --logpath /data/db/bug/server.log --fork --smallfiles --noprealloc

Add another DB and shutdown

> mongo admin
var dummyDb = db.getSisterDB( "fred507" ).foo.insert({x:1});
db.shutdownServer()

Startup will now fail

> mongod --dbpath /data/db/bug --logpath /data/db/bug/server.log --fork --smallfiles --noprealloc

Note:
If you take the same db files and now start them with 2.2.3 then it will startup OK.



 Comments   
Comment by Alvin Richards (Inactive) [ 16/Feb/13 ]

Tested successfully on OS-X with git version: 1bd8b84c64214356f482fa3164d88e664f585243

Comment by auto [ 13/Feb/13 ]

Author:

{u'date': u'2013-02-13T14:15:45Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-8515 ensure Windows can run with > 63 open fd's
Branch: master
https://github.com/mongodb/mongo/commit/849d5233b6e3900452a46aabbc09c0b77c1b30c2

Comment by Eric Milkie [ 12/Feb/13 ]

See SERVER-8521 for further commits to fix this issue.

Comment by auto [ 12/Feb/13 ]

Author:

{u'date': u'2013-02-12T19:02:38Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-8515 avoid memory corruption by checking max file descriptor limit
Branch: master
https://github.com/mongodb/mongo/commit/77ed8b1f3536574e9a72ab0361d1b0c414d5b899

Comment by Eric Milkie [ 12/Feb/13 ]

We will fix this by not opening all databases at startup; see SERVER-8521.

We should also prevent calling select() with fd's higher than FD_SETSIZE; that work will be done in this ticket.

Comment by Eric Milkie [ 12/Feb/13 ]

I ran this in the debugger. We are passing 1026 (maxfd+1) as the first parameter to select(). The Darwin man page says:

COMPATIBILITY
     select() now returns with errno set to EINVAL when nfds is greater than FD_SETSIZE.  Use a smaller
     value for nfds or compile with -D_DARWIN_UNLIMITED_SELECT.

Since our listening socket gets assigned a number higher than 1024, I don't think we can use select() with a bit array of 1024 (32 int32's) to listen on it?
This was working fine in 2.2 because we open the listening socket before we open all the database files, so the socket gets a low fd number.
I'm surprised that this works on Linux. What does FD_SET do when you give it a number higher than 1024?

Comment by Eric Milkie [ 12/Feb/13 ]

I tried this on OS X and I can reproduce the behavior same as Alvin. Possible OS X select() issue?

Comment by Alvin Richards (Inactive) [ 12/Feb/13 ]

-vvvvv I get the following on 2.4.0-rc0

 
Tue Feb 12 13:33:27.718 [IndexRebuilder] mmf open /data/db/bug/local.ns
Tue Feb 12 13:33:27.718 [IndexRebuilder] mmf finishOpening 0x8f20d3000 /data/db/bug/local.ns len:16777216
Tue Feb 12 13:33:27.719 [IndexRebuilder] mmf open /data/db/bug/local.0
Tue Feb 12 13:33:27.719 [IndexRebuilder] mmf finishOpening 0x8f40d3000 /data/db/bug/local.0 len:16777216
Tue Feb 12 13:33:27.719 [IndexRebuilder] mmf close 
Tue Feb 12 13:33:27.720 [IndexRebuilder] query local.system.namespaces ntoreturn:0 ntoskip:0 nscanned:1 keyUpdates:0  nreturned:1 reslen:114 0ms
Tue Feb 12 13:33:27.720 [initandlisten] runQuery called local.$cmd { create: "startup_log", size: 10485760, capped: true }
Tue Feb 12 13:33:27.720 [initandlisten] run command local.$cmd { create: "startup_log", size: 10485760, capped: true }
Tue Feb 12 13:33:27.720 [initandlisten] create collection local.startup_log { create: "startup_log", size: 10485760, capped: true }
Tue Feb 12 13:33:27.720 [initandlisten] command local.$cmd command: { create: "startup_log", size: 10485760, capped: true } ntoreturn:1 keyUpdates:0  reslen:75 0ms
Tue Feb 12 13:33:27.721 [initandlisten] insert local.startup_log ninserted:1 keyUpdates:0  0ms
Tue Feb 12 13:33:27.721 [initandlisten] fd limit hard:32768 soft:16384 max conn: 13107
Tue Feb 12 13:33:27.721 [websvr] fd limit hard:32768 soft:16384 max conn: 13107
Tue Feb 12 13:33:27.721 [websvr] admin web console waiting for connections on port 28017
Tue Feb 12 13:33:27.721 [initandlisten] waiting for connections on port 27017
Tue Feb 12 13:33:27.721 [initandlisten] select() failure: ret=-1 errno:22 Invalid argument
Tue Feb 12 13:33:27.721 [journal] journal WRITETODATAFILES 1
Tue Feb 12 13:33:27.721 [initandlisten] now exiting
Tue Feb 12 13:33:27.722 [journal] journal WRITETODATAFILES 2
Tue Feb 12 13:33:27.722 dbexit: 

Comment by Alvin Richards (Inactive) [ 12/Feb/13 ]

No stack trace, just these errors in the log above. Need me to try with more verbose logging?

Comment by Eliot Horowitz (Inactive) [ 12/Feb/13 ]

Just tried this and didn't get a crash.
Do you have a stack?

Comment by Alvin Richards (Inactive) [ 11/Feb/13 ]

Forgot to post the ulimits

 
vero:software$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 16384
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 709
virtual memory          (kbytes, -v) unlimited

Comment by Alvin Richards (Inactive) [ 11/Feb/13 ]

Looks like this was introduced between 2.3.0 (OK) and 2.3.1 (fails).

Generated at Thu Feb 08 03:17:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.