[SERVER-10582] oplog_all_ops.js fails on Assertion failure, mutex problem when locking ReplicaSetMonitor Created: 20/Aug/13  Updated: 11/Jul/16  Resolved: 22/Aug/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.5.2

Type: Bug Priority: Critical - P2
Reporter: Matt Kangas Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

MCI linux-64-debug-duroff


Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

I have successfully repro'd this failure on Linux 64 bit debug.

MCI says rev 07c221086ff5c2 fails, c735600b22a was good. Will bisect asap.

http://mci.10gen.com/ui/task/mongodb_mongo_master_linux_64_debug_duroff_07c221086ff5c20d4d62ba69837d90404459c335_13_08_20_18_54_06_tool_linux_64#logs/task/true
http://buildlogs.mongodb.org/mci_0.9_linux_64_debug_duroff/builds/546/test/tool_0/oplog_all_ops.js

mutex problem
  when locking ReplicaSetMonitor
  ReplicaSetMonitorWatcher::_safego was already locked and should not be.
  set a and b above to debug.
2013-08-20T19:37:42.022+0000 [ReplicaSetMonitorWatcher]  Assertion failure 0 src/mongo/util/concurrency/mutexdebugger.h 134
2013-08-20T19:37:42.028+0000 [ReplicaSetMonitorWatcher]  0x9d5277 0x9bfc54 0x9a8bc8 0x8a988a 0x8a9ba0 0x9128b9 0x9201e7 0x9a9ee1 0x9accd2 0x9acbdd 0x9acb5f 0x9acaf2 0xa13266 0x7f7140339c6b 0x7f713f6df5ed 
 ./mongo(_ZN5mongo15printStackTraceERSo+0x27) [0x9d5277]
 ./mongo(_ZN5mongo10logContextEPKc+0x71) [0x9bfc54]
 ./mongo(_ZN5mongo12verifyFailedEPKcS1_j+0xff) [0x9a8bc8]
 ./mongo(_ZN5mongo13MutexDebugger8enteringEPKc+0x6e6) [0x8a988a]
 ./mongo(_ZN5mongo5mutex11scoped_lockC1ERS0_+0x74) [0x8a9ba0]
 ./mongo(_ZN5mongo17ReplicaSetMonitor8checkAllEv+0x45) [0x9128b9]
 ./mongo(_ZN5mongo24ReplicaSetMonitorWatcher3runEv+0xb1) [0x9201e7]
 ./mongo(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0x2cf) [0x9a9ee1]
 ./mongo(_ZNK5boost4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS3_9JobStatusEEEEclEPS3_S6_+0x7a) [0x9accd2]
 ./mongo(_ZN5boost3_bi5list2INS0_5valueIPN5mongo13BackgroundJobEEENS2_INS_10shared_ptrINS4_9JobStatusEEEEEEclINS_4_mfi3mf1IvS4_S9_EENS0_5list0EEEvNS0_4typeIvEERT_RT0_i+0x77) [0x9acbdd]
 ./mongo(_ZN5boost3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS5_9JobStatusEEEEENS0_5list2INS0_5valueIPS5_EENSB_IS8_EEEEEclEv+0x39) [0x9acb5f]
 ./mongo(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x1e) [0x9acaf2]
 ./mongo() [0xa13266]
 /lib64/libpthread.so.0(+0x7c6b) [0x7f7140339c6b]
 /lib64/libc.so.6(clone+0x6d) [0x7f713f6df5ed]
2013-08-20T19:37:42.028+0000 [ReplicaSetMonitorWatcher] 
***aborting after verify() failure as this is a debug/test build



 Comments   
Comment by auto [ 21/Aug/13 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-10582 oplog_all_ops.js fails on Assertion failure, mutex problem when locking RSM

Make sure that BackgroundJob is running before calling wait.
Branch: master
https://github.com/mongodb/mongo/commit/3696ac227caa1dd04aa133ad202b8250ebd7c02d

Comment by auto [ 21/Aug/13 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-10582 oplog_all_ops.js fails on Assertion failure, mutex problem when locking RSM

Avoid the mutex lock ordering inconsistency by not holding the _monitorMutex when calling ReplicaSetMonitor::checkAll.
Branch: master
https://github.com/mongodb/mongo/commit/401b30bc474e51658268c9a27b7ca3f6ae70aa41

Comment by Matt Kangas [ 20/Aug/13 ]

Verified that both of these tests now pass with Randolph's http://codereview.10gen.com/11462033/ applied

jstests/tool/oplog_all_ops.js
jstests/sharding/shard_insert_getlasterror_w2.js

Comment by Matt Kangas [ 20/Aug/13 ]

FYI, `jstests/sharding/shard_insert_getlasterror_w2.js` is failing. When trying to repro I saw the same assertion failure.

 m31006| 2013-08-20T18:02:27.410-0500 [Balancer] distributed lock 'balancer/buildmeister.10gen.cc:31006:1377039738:1804289383' acquired, ts : 5213f583d332d8ac92ec7218
 m31000| 2013-08-20T18:02:27.532-0500 [conn4] end connection 127.0.0.1:39003 (10 connections now open)
 m31000| 2013-08-20T18:02:27.532-0500 [initandlisten] connection accepted from 127.0.0.1:39064 #19 (12 connections now open)
 m31006| 2013-08-20T18:02:27.546-0500 [Balancer] distributed lock 'balancer/buildmeister.10gen.cc:31006:1377039738:1804289383' unlocked.
 m31000| 2013-08-20T18:02:29.745-0500 [conn9] end connection 127.0.0.1:39008 (10 connections now open)
 m31000| 2013-08-20T18:02:29.745-0500 [initandlisten] connection accepted from 127.0.0.1:39065 #20 (12 connections now open)
 m31006| mutex problem
 m31006|   when locking ReplicaSetMonitor
 m31006|   ReplicaSetMonitorWatcher::_safego was already locked and should not be.
 m31006|   set a and b above to debug.
 m31006|
 m31006| 2013-08-20T18:02:30.936-0500 [ReplicaSetMonitorWatcher]  Assertion failure 0 src/mongo/util/concurrency/mutexdebugger.h 134
 m31006| 2013-08-20T18:02:30.946-0500 [ReplicaSetMonitorWatcher]  0xec2809 0xea8ed7 0xe916f3 0xb54301 0xb545c0 0xb81cc6 0xb90540 0xe92c7a 0xe95b19 0xe95a17 0xe95995 0xe95922 0xf01ce6 0x3deda07851 0x3ded6e890d
 m31006|  /home/kangas/workspace/mongo/mongos(_ZN5mongo15printStackTraceERSo+0x29) [0xec2809]
 m31006|  /home/kangas/workspace/mongo/mongos(_ZN5mongo10logContextEPKc+0x94) [0xea8ed7]
 m31006|  /home/kangas/workspace/mongo/mongos(_ZN5mongo12verifyFailedEPKcS1_j+0x119) [0xe916f3]
 m31006|  /home/kangas/workspace/mongo/mongos(_ZN5mongo13MutexDebugger8enteringEPKc+0x797) [0xb54301]
 m31006|  /home/kangas/workspace/mongo/mongos(_ZN5mongo5mutex11scoped_lockC1ERS0_+0x76) [0xb545c0]
Received signal 6
Backtrace:  m31006|  /home/kangas/workspace/mongo/mongos(_ZN5mongo17ReplicaSetMonitor8checkAllEv+0x44) [0xb81cc6]
 m31006|  /home/kangas/workspace/mongo/mongos(_ZN5mongo24ReplicaSetMonitorWatcher3runEv+0xce) [0xb90540]
 m31006|  /home/kangas/workspace/mongo/mongos(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0x39c) [0xe92c7a]
 m31006|  /home/kangas/workspace/mongo/mongos(_ZNK5boost4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS3_9JobStatusEEEEclEPS3_S6_+0x7d) [0xe95b19]
0xec05d7  m31006|  /home/kangas/workspace/mongo/mongos(_ZN5boost3_bi5list2INS0_5valueIPN5mongo13BackgroundJobEEENS2_INS_10shared_ptrINS4_9JobStatusEEEEEEclINS_4_mfi3mf1IvS4_S9_EENS0_5list0EEEvNS0_4typeIvEERT_RT0_i+0x79) [0xe95a17]
0xec069c  m31006|  /home/kangas/workspace/mongo/mongos(_ZN5boost3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS5_9JobStatusEEEEENS0_5list2INS0_5valueIPS5_EENSB_IS8_EEEEEclEv+0x3f) [0xe95995]
0x3ded632920 0x3ded6328a5 0x3ded634085 0xe9185b 0xb54301  m31006|  /home/kangas/workspace/mongo/mongos(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x1e) [0xe95922]
0xb545c0 0xb81cc6 0xb90540 0xe92c7a 0xe95b19  m31006|  /home/kangas/workspace/mongo/mongos() [0xf01ce6]
0xe95a17 0xe95995 0xe95922 0xf01ce6 0x3deda07851  m31006|  /lib64/libpthread.so.0() [0x3deda07851]
0x3ded6e890d
 m31006|  /lib64/libc.so.6(clone+0x6d) [0x3ded6e890d]
 m31006| 2013-08-20T18:02:30.946-0500 [ReplicaSetMonitorWatcher] dev: lastError==0 won't report:0
 m31006| 2013-08-20T18:02:30.946-0500 [ReplicaSetMonitorWatcher]
 m31006|
 m31006| ***aborting after verify() failure as this is a debug/test build

Comment by Matt Kangas [ 20/Aug/13 ]

07c221086ff5c20d4d62ba69837d90404459c335 is the first bad commit
commit 07c221086ff5c20d4d62ba69837d90404459c335
Author: Randolph Tan <randolph@10gen.com>
Date:   Mon Aug 12 15:53:16 2013 -0400
 
    SERVER-8707 dbclient_rs_test threading issue
 
    Thoroughly cleanup all replica set connection related artifacts after every test.
 
:040000 040000 05985579640e8e0f265c789d91bab55db94d1777 1fba84636f312adc5c2b3d9d4343a138338eec0f M src
bisect run success

Comment by Matt Kangas [ 20/Aug/13 ]

Ignore my previous git-bisect result. Case of GIGO; all it identified was the compile failure in e54b41cc. Later commit f1f9514 is good. That leaves one of Ren's commits.

07c2210 2013-08-12 15:53 Randolph Tan       o [master] [origin/HEAD] [origin/master] SERVER-8707 dbclient_rs_test threading i~
712619f 2013-08-20 14:37 Randolph Tan       o SERVER-8707 dbclient_rs_test threading issue

Comment by Randolph Tan [ 20/Aug/13 ]

I have an idea of the cause and believe with high confidence that my recent commit broke this.

Generated at Thu Feb 08 03:23:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.