Deadlock holding global write lock when mongod started without --fork

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Blocker - P1
    • None
    • Affects Version/s: 2.5.1
    • Component/s: Concurrency
    • None
    • Environment:
      Amazon Linux 64-bit, Jenkins CI slaves.
    • Linux
    • Hide

      Start a 3-node replica set with mongod version 2.5.1 using our rs_manager tool. rs_manager starts mongods as subprocesses, calls replSetInitiate, and then it exits. If you pass --fork to the subprocesses the hang does not occur.

      Run PyMongo's unittest suite against the replica set. The Python and PyMongo version don't matter.

      Show
      Start a 3-node replica set with mongod version 2.5.1 using our rs_manager tool. rs_manager starts mongods as subprocesses, calls replSetInitiate , and then it exits. If you pass --fork to the subprocesses the hang does not occur. Run PyMongo's unittest suite against the replica set. The Python and PyMongo version don't matter.
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Running the PyMongo unittest suite leads to a hang with some op hold the global write lock. The server responds to currentOp and rs.status but any operation requiring a lock hangs indefinitely.

      The point in the suite where we reach this hang is unpredictable, but most commonly in tests with 10s or 100s of Python threads.

      Server 2.5.0 never hangs, 2.5.1 almost always hangs at some point in PyMongo's test run. A successful run lasts about 3 minutes and executes about 515 tests. Interestingly, 2.5.1 doesn't hang when started with --fork.

      Example ops holding the write lock at the point where 2.5.1 hangs. In one test run:

      		{
      			"opid" : 30145,
      			"active" : true,
      			"secs_running" : 265,
      			"op" : "query",
      			"ns" : "pymongo_test",
      			"query" : {
      				"create" : "test",
      				"capped" : true,
      				"size" : 1000
      			},
      			"client" : "127.0.0.1:57030",
      			"desc" : "conn1509",
      			"threadId" : "0x7f5f76c81700",
      			"connectionId" : 1509,
      			"locks" : {
      				"^" : "w",
      				"^pymongo_test" : "W"
      			},
      			"waitingForLock" : false,
      			"numYields" : 0,
      			"lockStats" : {
      				"timeLockedMicros" : {
      
      				},
      				"timeAcquiringMicros" : {
      					"r" : NumberLong(0),
      					"w" : NumberLong(3)
      				}
      			}
      		}
      

      In another:

      		{
      			"opid" : 35786,
      			"active" : true,
      			"secs_running" : 377,
      			"op" : "insert",
      			"ns" : "pymongo-pooling-tests.unique",
      			"insert" : {
       
      			},
      			"client" : "127.0.0.1:54746",
      			"desc" : "conn497",
      			"threadId" : "0x7f9612829700",
      			"connectionId" : 497,
      			"locks" : {
      				"^" : "w",
      				"^pymongo-pooling-tests" : "W"
      			},
      			"waitingForLock" : false,
      			"msg" : "index: (1/3) external sort",
      			"numYields" : 0,
      			"lockStats" : {
      				"timeLockedMicros" : {
       
      				},
      				"timeAcquiringMicros" : {
      					"r" : NumberLong(0),
      					"w" : NumberLong(3)
      				}
      			}
      		}
      

      Backtraces from the latter run:

      https://gist.github.com/ajdavis/275df7f2967ba63bb9ea

        1. repro-0-backtrace.txt
          122 kB
          A. Jesse Jiryu Davis
        2. repro-0-currentOp.txt
          2 kB
          A. Jesse Jiryu Davis
        3. repro-0-pymongo-tests.txt
          28 kB
          A. Jesse Jiryu Davis
        4. repro-1-backtrace.txt
          35 kB
          A. Jesse Jiryu Davis
        5. repro-1-currentOp.txt
          2 kB
          A. Jesse Jiryu Davis
        6. repro-1-pymongo-tests.txt
          28 kB
          A. Jesse Jiryu Davis

            Assignee:
            Unassigned
            Reporter:
            A. Jesse Jiryu Davis
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: