[SERVER-7434] Startup race with --fork Created: 21/Oct/12 Updated: 11/Jul/16 Resolved: 01/Feb/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency |
| Affects Version/s: | 2.2.0, 2.2.2, 2.3.2 |
| Fix Version/s: | 2.2.3, 2.4.0-rc1 |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Martin Buechler | Assignee: | Andy Schwerin |
| Resolution: | Done | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
All supported OSes except Windows. |
||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Installed the 2.2.0 rpm package from 10gen repo. 'service mongod start' creates 3 processes: root 12603 9583 0 12:59 pts/0 00:00:00 /bin/sh /sbin/service mongod restart strace of PID 12648, the third - obviously hanging - process gives: ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) gdb: Thread 2 (Thread 0x40a87940 (LWP 12649)): Thread 1 (Thread 0x2b5613d478c0 (LWP 12648)): This behaviour is somewhat random, because sometimes the startup works. Notes: I rebuilt mongod from source r2.2.0, stripped the binary manually and to my surprise this binary, does not show this behaviour. Alas, another binary installed with 'scons install' always hangs. |
| Comments |
| Comment by Andy Schwerin [ 11/Mar/13 ] | ||||||||
|
This issue probably has gperftools issue-496, resolved by r196, as its root cause. See link below. https://code.google.com/p/gperftools/source/detail?r=196 The fixes applied to the mongodb source to resolve | ||||||||
| Comment by auto [ 01/Feb/13 ] | ||||||||
|
Author: {u'date': u'2013-01-22T23:19:22Z', u'email': u'schwerin@10gen.com', u'name': u'Andy Schwerin'}Message: | ||||||||
| Comment by auto [ 23/Jan/13 ] | ||||||||
|
Author: {u'date': u'2013-01-22T19:29:18Z', u'name': u'Andy Schwerin', u'email': u'schwerin@10gen.com'}Message: | ||||||||
| Comment by Andy Schwerin [ 22/Jan/13 ] | ||||||||
|
I've confirmed the existence of a startup race when --fork is passed. When using --fork, there are three processes of interest. The "child" is the final process left running mongod. The "parent" is the process that invokes it, and the "grandparent" is the process that invokes the parent. These three processes are all images of mongod. The problem occurs because the grandparent starts a thread before calling fork, to handle terminal interrupts. That thread allocates heap memory as part of its startup. If the grandparent calls fork() from the main thread while the interrupt thread holds a malloc lock, the parent and grandparent both execute in address spaces where the malloc lock is held but the holding thread does not exist. Easiest repro is to install the product from RPM on a RHEL system, and run the following shell scriptlet as root:
Then, just wait for it to hang, and use gdb -p to attach to the hung processes. Because of the nature of the race, it is easier to stimulate if you drive up the rate of context switches and drive down the number of available cores. numctl or running on a single-core VM helps with the latter, and a script like the following helps with the former:
| ||||||||
| Comment by WangYu [ 20/Jan/13 ] | ||||||||
|
I meet the same issue when using python script to start mongodb. From the ps, there are three processes and hang. Strace the last process, the result as below. This not happen every time, but random. Any suggestion how to avoid this issue? Thanks! -bash-4.1# ps -ef|grep mongo ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ) = -1 ETIMEDOUT (Connection timed out) ^C <unfinished ...> |