[SERVER-10494] Deadlock holding global write lock when mongod started without --fork Created: 12/Aug/13 Updated: 13/Aug/13 Resolved: 13/Aug/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency |
| Affects Version/s: | 2.5.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | A. Jesse Jiryu Davis | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Amazon Linux 64-bit, Jenkins CI slaves. |
||
| Attachments: |
|
| Operating System: | Linux |
| Steps To Reproduce: | Start a 3-node replica set with mongod version 2.5.1 using our rs_manager tool. rs_manager starts mongods as subprocesses, calls replSetInitiate, and then it exits. If you pass --fork to the subprocesses the hang does not occur. Run PyMongo's unittest suite against the replica set. The Python and PyMongo version don't matter. |
| Participants: |
| Description |
|
Running the PyMongo unittest suite leads to a hang with some op hold the global write lock. The server responds to currentOp and rs.status but any operation requiring a lock hangs indefinitely. The point in the suite where we reach this hang is unpredictable, but most commonly in tests with 10s or 100s of Python threads. Server 2.5.0 never hangs, 2.5.1 almost always hangs at some point in PyMongo's test run. A successful run lasts about 3 minutes and executes about 515 tests. Interestingly, 2.5.1 doesn't hang when started with --fork. Example ops holding the write lock at the point where 2.5.1 hangs. In one test run:
In another:
Backtraces from the latter run: https://gist.github.com/ajdavis/275df7f2967ba63bb9ea |
| Comments |
| Comment by A. Jesse Jiryu Davis [ 13/Aug/13 ] | |||
|
PyMongo tests start mongod with --logpath and -vvvvv, which mongod interprets to mean "write more info to the log, and also write some more info to stdout". For example, if the log level is >= 2, printMemInfo periodically writes to stdout. If stdout is a pipe from which no one reads, its buffer will eventually fill, blocking any threads that write to it. Once a thread holding the global lock blocks writing to stdout, the server is completely locked. Evidently 2.5.1 writes more to stdout than 2.5.0 when verbosity is high, probably a consequence of changes for | |||
| Comment by A. Jesse Jiryu Davis [ 13/Aug/13 ] | |||
|
Not a server bug, just a bug in my code. PyMongo's tests had worked by accident before this. | |||
| Comment by J Rassi [ 12/Aug/13 ] | |||
|
I'd guess that rs_manager calls Popen.wait()? See excerpt from: http://docs.python.org/2/library/subprocess.html#subprocess.PIPE
Looks like using communicate() instead of wait() will make python consume from standard out, which should fix the problem.
That'd be my guess, too. | |||
| Comment by A. Jesse Jiryu Davis [ 12/Aug/13 ] | |||
|
rs_manager does:
Where cmd is mongod and Popen is a Python stdlib thing. Python then makes a pipe: http://hg.python.org/cpython/file/ab550dac6209/Modules/_posixsubprocess.c#l727 Then it forks, dupes the pipe's fds, and execs mongod: http://hg.python.org/cpython/file/ab550dac6209/Modules/_posixsubprocess.c#l378 So, yes, mongod is writing to a pipe that the parent is not reading from. I wonder why 2.5.1 hangs in this situation whereas 2.5.0 doesn't? It may simply be logging more and run out of buffer. PyMongo's tests can be easily updated to avoid this if we decide it isn't a server bug. | |||
| Comment by A. Jesse Jiryu Davis [ 12/Aug/13 ] | |||
|
rassi@10gen.com, I need to look into Python's subprocess module a little to tell you for sure, but I think that "written to a pipe no process is consuming from" is likely accurate. | |||
| Comment by A. Jesse Jiryu Davis [ 12/Aug/13 ] | |||
|
I've attached some files that show info about two consecutive repros of this issue, with the dbpath wiped out the replica set recreated between runs. | |||
| Comment by J Rassi [ 12/Aug/13 ] | |||
|
Where's standard out being written to in the unit test suite? Thread 48 (the one holding the global lock) looks like it's blocking in an fflush(stdout) call, which I'd expect to happen if the buffer is full (e.g. being written to a pipe that no process is consuming from). |