[SERVER-9362] Mongod crashes at first getnonce call if process started without stdout and stderr file descriptors initialized. Created: 16/Apr/13 Updated: 21/Jul/15 Resolved: 11/Nov/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Internal Code |
| Affects Version/s: | 2.4.1 |
| Fix Version/s: | 2.5.4 |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Tyler Brock | Assignee: | Andy Schwerin |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
EC2, Amazon Linux AMI release 2013.03, MongoDB 2.4.1 or MongoDB master branch |
||
| Issue Links: |
|
||||||||
| Operating System: | Linux | ||||||||
| Participants: | |||||||||
| Description |
|
When initiating the replica set from the driver one of the three members of the set will always crash with the following stack trace (built with --d):
This does NOT happen locally on Mac OS X when using the binary or building from source but does happen using on Amazon Linux on EC2 using master or a binary. I tried to reproduce on a single mongod by writing a program which runs the {{ {getnonce: 1}}} many times but I was unable to crash mongod in that way. Running addr2line on the stack trace built with --d yields line 130 in src/mongo/platform/random.cpp as the source:
|
| Comments |
| Comment by Githook User [ 21/Jul/15 ] | ||
|
Author: {u'username': u'dagolden', u'name': u'David Golden', u'email': u'dagolden@cpan.org'}Message: devel: open handles to /dev/null rather than closing On Linux, old mongod/mongos processes can crash when STDOUT, etc. See | ||
| Comment by Andy Schwerin [ 11/Nov/13 ] | ||
|
Resolved by the log framework rewrite, | ||
| Comment by auto [ 11/Jun/13 ] | ||
|
Author: {u'username': u'TylerBrock', u'name': u'Tyler Brock', u'email': u'tyler.brock@gmail.com'}Message: This previously was failing as the ruby Process#spawn method was closing You can read more about the issue here: https://jira.mongodb.org/browse/SERVER-9362 | ||
| Comment by auto [ 23/Apr/13 ] | ||
|
Author: {u'date': u'2013-04-20T22:58:20Z', u'name': u'Tyler Brock', u'email': u'tyler.brock@gmail.com'}Message: This previously was failing as the ruby Process#spawn method was closing You can read more about the issue here: https://jira.mongodb.org/browse/SERVER-9362 | ||
| Comment by Tad Marshall [ 20/Apr/13 ] | ||
|
Excellent writeup Andy, thanks! I wonder if an additional change we could make in the future would be to stop using dup2() to map stdout and stderr to the log file. Stack traces are currently sent to stdout on most platforms, which interferes with sending them to syslog; other than that, there may be no really valid cases of using stdout directly (and not by way of Logstream). Removing static initializers whenever possible would also be helpful. | ||
| Comment by Andy Schwerin [ 19/Apr/13 ] | ||
|
Labeling minor because the work around is to not close fd1 and fd2 before forking. User can instead map the two fds to /dev/null. | ||
| Comment by Andy Schwerin [ 19/Apr/13 ] | ||
|
This is an initialization-order bug. After an hour or two pair-debugging with tyler@10gen.com, we determined the following.
The workaround is to never start mongod with file descriptors 1 and 2 closed (have to do both, because logging dup2s onto 2/stderr, also). A solution to make mongod more robust is to fix the order of initialization so no files get opened before the logging system gets initialized. | ||
| Comment by Tyler Brock [ 17/Apr/13 ] | ||
|
I'm using Amazon Linux on EC2 and assume you used ubuntu (maybe locally) I'll try to repro in the shell. | ||
| Comment by Spencer Brody (Inactive) [ 17/Apr/13 ] | ||
|
Can you describe exactly how you're setting up the replica set? I just tried to reproduce by running:
and it worked fine. | ||
| Comment by Tyler Brock [ 17/Apr/13 ] | ||
|
Yes, I've reproduced on 2.4.2 with the concurrency issue mentioned in | ||
| Comment by Spencer Brody (Inactive) [ 17/Apr/13 ] | ||
|
Can you reproduce this consistently? If so, can you try on 2.4.2 which has the getnonce concurrency issue fixed? | ||
| Comment by Tyler Brock [ 16/Apr/13 ] | ||
|
Linked ticket is not the same but similar. |