[SERVER-9362] Mongod crashes at first getnonce call if process started without stdout and stderr file descriptors initialized. Created: 16/Apr/13  Updated: 21/Jul/15  Resolved: 11/Nov/13

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: 2.4.1
Fix Version/s: 2.5.4

Type: Bug Priority: Minor - P4
Reporter: Tyler Brock Assignee: Andy Schwerin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

EC2, Amazon Linux AMI release 2013.03, MongoDB 2.4.1 or MongoDB master branch


Issue Links:
Related
is related to SERVER-9014 Mongod and mongos crash induced by ma... Closed
Operating System: Linux
Participants:

 Description   

When initiating the replica set from the driver one of the three members of the set will always crash with the following stack trace (built with --d):

# Stack trace from master as of today
Tue Apr 16 22:34:16.644 Backtrace:
0x10f29db 0xc21a20 0x7f0563f649c0 0x7f0563f64945 0x7f0563f6625b 0xfec39c 0xd45867 0xd456f7 0xd98639 0xd9936a 0xd99ce1 0xebafce 0xebf988 0xe6a26f 0xe6b19c 0xc29e81 0x10e312f 0x7f0564c6cc6b 0x7f05640125ed 
 /home/tyler/mongo/mongod(_ZN5mongo15printStackTraceERSo+0x27) [0x10f29db]
 /home/tyler/mongo/mongod(_ZN5mongo10abruptQuitEi+0x110) [0xc21a20]
 /lib64/libc.so.6(+0x349c0) [0x7f0563f649c0]
 /lib64/libc.so.6(gsignal+0x35) [0x7f0563f64945]
 /lib64/libc.so.6(abort+0x17b) [0x7f0563f6625b]
 /home/tyler/mongo/mongod(_ZN5mongo23InputStreamSecureRandom9nextInt64Ev+0x56) [0xfec39c]
 /home/tyler/mongo/mongod(_ZN5mongo11CmdGetNonce12getNextNonceEv+0x43) [0xd45867]
 /home/tyler/mongo/mongod(_ZN5mongo11CmdGetNonce3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x4f) [0xd456f7]
 /home/tyler/mongo/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x8c) [0xd98639]
 /home/tyler/mongo/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xacc) [0xd9936a]
 /home/tyler/mongo/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x396) [0xd99ce1]
 /home/tyler/mongo/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x46) [0xebafce]
 /home/tyler/mongo/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x392) [0xebf988]
 /home/tyler/mongo/mongod() [0xe6a26f]
 /home/tyler/mongo/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x2e8) [0xe6b19c]
 /home/tyler/mongo/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xe3) [0xc29e81]
 /home/tyler/mongo/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x4de) [0x10e312f]
 /lib64/libpthread.so.0(+0x7c6b) [0x7f0564c6cc6b]
 /lib64/libc.so.6(clone+0x6d) [0x7f05640125ed]

This does NOT happen locally on Mac OS X when using the binary or building from source but does happen using on Amazon Linux on EC2 using master or a binary.

I tried to reproduce on a single mongod by writing a program which runs the {{

{getnonce: 1}

}} many times but I was unable to crash mongod in that way.

Running addr2line on the stack trace built with --d yields line 130 in src/mongo/platform/random.cpp as the source:

int64_t nextInt64() {  
  int64_t r;                                                                                                                 
  _in->read( reinterpret_cast<char*>( &r ), sizeof(r) );    
  if ( _in->fail() ) { 
    abort();
  }
  return r;
}



 Comments   
Comment by Githook User [ 21/Jul/15 ]

Author:

{u'username': u'dagolden', u'name': u'David Golden', u'email': u'dagolden@cpan.org'}

Message: devel: open handles to /dev/null rather than closing

On Linux, old mongod/mongos processes can crash when STDOUT, etc.
is closed. This fixes the harness to reopen them to /dev/null instead.

See SERVER-9362 for more details.
Branch: master
https://github.com/mongodb/mongo-perl-driver/commit/7a8ff77a1df6d1b2043fcea3a28b6047b1a49210

Comment by Andy Schwerin [ 11/Nov/13 ]

Resolved by the log framework rewrite, SERVER-10084.

Comment by auto [ 11/Jun/13 ]

Author:

{u'username': u'TylerBrock', u'name': u'Tyler Brock', u'email': u'tyler.brock@gmail.com'}

Message: RUBY-583: add ability to run clusters with auth

This previously was failing as the ruby Process#spawn method was closing
the STDOUT file descriptor. Mongod as of (2.4.2) expects for STDIN and
STDOUT to be the first two file descriptors as the logging system
subsequently tries to reopen and reroute those hardcoded FD values.

You can read more about the issue here:

https://jira.mongodb.org/browse/SERVER-9362
Branch: 1.x-stable
https://github.com/mongodb/mongo-ruby-driver/commit/3983469a7d75d1658faed7d4320bd0b7dc4c9da7

Comment by auto [ 23/Apr/13 ]

Author:

{u'date': u'2013-04-20T22:58:20Z', u'name': u'Tyler Brock', u'email': u'tyler.brock@gmail.com'}

Message: RUBY-583: add ability to run clusters with auth

This previously was failing as the ruby Process#spawn method was closing
the STDOUT file descriptor. Mongod as of (2.4.2) expects for STDIN and
STDOUT to be the first two file descriptors as the logging system
subsequently tries to reopen and reroute those hardcoded FD values.

You can read more about the issue here:

https://jira.mongodb.org/browse/SERVER-9362
Branch: 1.x-stable
https://github.com/mongodb/mongo-ruby-driver/commit/9e9de1e2c9e66e0c69d445d1daa036edfd878dae

Comment by Tad Marshall [ 20/Apr/13 ]

Excellent writeup Andy, thanks!

I wonder if an additional change we could make in the future would be to stop using dup2() to map stdout and stderr to the log file. Stack traces are currently sent to stdout on most platforms, which interferes with sending them to syslog; other than that, there may be no really valid cases of using stdout directly (and not by way of Logstream).

Removing static initializers whenever possible would also be helpful.

Comment by Andy Schwerin [ 19/Apr/13 ]

Labeling minor because the work around is to not close fd1 and fd2 before forking. User can instead map the two fds to /dev/null.

Comment by Andy Schwerin [ 19/Apr/13 ]

This is an initialization-order bug. After an hour or two pair-debugging with tyler@10gen.com, we determined the following.

  1. The ruby test program in question spawns mongod with file descriptors 1 and 2 both closed, and passing --logpath.
  2. mongod starts, and very early on (static initialization time) instantiates CmdGetNonce, which indirectly opens and stores a file descriptor for reading /dev/urandom. Because descriptor 1 is free, and open takes the lowest available descriptor (as required by its API), descriptor 1 is allocated for this purpose.
  3. The logging system is initialized, and uses dup2 to remap file descriptor 1 to be the log file, so std::cout goes to the log file.
  4. A getnonce command arrives and tries to read file descriptor 1, thinking it's /dev/urandom, opened for read. But, thanks to the logging system, it's the log file, opened for write. This puts the descriptor in a bad state, and mongod aborts.

The workaround is to never start mongod with file descriptors 1 and 2 closed (have to do both, because logging dup2s onto 2/stderr, also).

A solution to make mongod more robust is to fix the order of initialization so no files get opened before the logging system gets initialized.

Comment by Tyler Brock [ 17/Apr/13 ]

I'm using Amazon Linux on EC2 and assume you used ubuntu (maybe locally) I'll try to repro in the shell.

Comment by Spencer Brody (Inactive) [ 17/Apr/13 ]

Can you describe exactly how you're setting up the replica set? I just tried to reproduce by running:

conf = {_id:"myset", version:1, members: [{_id:0, host:'ubuntu:11111'}, {_id:1, host:'ubuntu:22222'}, {_id:2, host:'ubuntu:33333'}]}
rs.initiate(conf)

and it worked fine.

Comment by Tyler Brock [ 17/Apr/13 ]

Yes, I've reproduced on 2.4.2 with the concurrency issue mentioned in SERVER-9014 fixed and it happens every time (on a random member of the set)

Comment by Spencer Brody (Inactive) [ 17/Apr/13 ]

Can you reproduce this consistently? If so, can you try on 2.4.2 which has the getnonce concurrency issue fixed?

Comment by Tyler Brock [ 16/Apr/13 ]

Linked ticket is not the same but similar.

Generated at Thu Feb 08 03:20:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.