[SERVER-2445] Assertion failure in replica set + durability while killing and restarting servers Created: 31/Jan/11  Updated: 03/Sep/11  Resolved: 03/Sep/11

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: 1.7.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tony Hannan Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

db version v1.8.0-rc0-pre-
git version: 9a6f2dcf91c2025f7549c875e78e4a203b26adcf


Operating System: OS X
Participants:

 Description   

I have a test case that
1. Starts up a replica set with 2 servers and 1 arbiter, all with --dur option.
2. Loads 10000 1K docs into a collection
3. Forks 5 client threads, each $pushes then $pulls its own id to/from an array in all document (multi-update)
5. A 6th thread kills a random server in the replica set every 0-60 secs then restarts it 0-30 secs later.

After a few minutes I get the error below in one of the servers.

You can run my testcase by downloading http://github.com/TonyGen/mongoTest-cpp and running
> mongoTest 666 controller 127.0.0.1 mongoTest::One
The log of each server is written to the current directory and is named mongodX.out where X is 1, 2, or 3. I usually tail -F each in a separate terminal window while it is running.

I get an assertion error on every run on my MacBook Pro.

Log from first replica server where error starts:
Mon Jan 31 14:17:55 [conn6] couldn't make room for new record (len: 144) in capped ns local.oplog.rs
Extent 0 (capExtent)
magic: 41424344 extent->ns: local.oplog.rs
fr: null lr: 2:3771fd0 extent->len: 192000256
local.oplog.rs Assertion failure len * 5 > lastExtentSize db/namespace.cpp 423
0x100037d3e 0x10003d9c6 0x100097ed8 0x10009b95d 0x10009773c 0x1002ca34a 0x100287691 0x1000914a2 0x1001f4f0f 0x1001f79cb 0x1002baef0 0x1002bdd86 0x1003c5c1d 0x1007c6254 0x7fff887cb536 0x7fff887cb3e9
0 mongod 0x0000000100037d3e _ZN5mongo12sayDbContextEPKc + 174
1 mongod 0x000000010003d9c6 _ZN5mongo8assertedEPKcS1_j + 342
2 mongod 0x0000000100097ed8 _ZNK5mongo16NamespaceDetails13maybeComplainEPKci + 1736
3 mongod 0x000000010009b95d _ZN5mongo16NamespaceDetails11cappedAllocEPKci + 2637
4 mongod 0x000000010009773c _ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE + 492
5 mongod 0x00000001002ca34a _ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci + 394
6 mongod 0x0000000100287691 _ZN5mongoL8_logOpRSEPKcS1_S1_RKNS_7BSONObjEPS2_Pb + 1761
7 mongod 0x00000001000914a2 _ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb + 66
8 mongod 0x00000001001f4f0f _ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE + 4367
9 mongod 0x00000001001f79cb _ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE + 315
10 mongod 0x00000001002baef0 _ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE + 1120
11 mongod 0x00000001002bdd86 _ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE + 5734
12 mongod 0x00000001003c5c1d _ZN5mongo10connThreadEPNS_13MessagingPortE + 653
13 libboost_thread-mt.dylib 0x00000001007c6254 thread_proxy + 132
14 libSystem.B.dylib 0x00007fff887cb536 _pthread_start + 331
15 libSystem.B.dylib 0x00007fff887cb3e9 thread_start + 13
Mon Jan 31 14:17:55 [conn6] update test.col query: {} exception 0 assertion db/namespace.cpp:423 31ms
Mon Jan 31 14:17:55 [conn7] couldn't make room for new record (len: 124) in capped ns local.oplog.rs
Extent 0 (capExtent)
magic: 41424344 extent->ns: local.oplog.rs
fr: null lr: 2:3771fd0 extent->len: 192000256
local.oplog.rs Assertion failure len * 5 > lastExtentSize db/namespace.cpp 423
0x100037d3e 0x10003d9c6 0x100097ed8 0x10009b95d 0x10009773c 0x1002ca34a 0x100287691 0x1000914a2 0x1001f4f0f 0x1001f79cb 0x1002baef0 0x1002bdd86 0x1003c5c1d 0x1007c6254 0x7fff887cb536 0x7fff887cb3e9
0 mongod 0x0000000100037d3e _ZN5mongo12sayDbContextEPKc + 174
1 mongod 0x000000010003d9c6 _ZN5mongo8assertedEPKcS1_j + 342
2 mongod 0x0000000100097ed8 _ZNK5mongo16NamespaceDetails13maybeComplainEPKci + 1736
3 mongod 0x000000010009b95d _ZN5mongo16NamespaceDetails11cappedAllocEPKci + 2637
4 mongod 0x000000010009773c _ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE + 492
5 mongod 0x00000001002ca34a _ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci + 394
6 mongod 0x0000000100287691 _ZN5mongoL8_logOpRSEPKcS1_S1_RKNS_7BSONObjEPS2_Pb + 1761
7 mongod 0x00000001000914a2 _ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb + 66
8 mongod 0x00000001001f4f0f _ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE + 4367
9 mongod 0x00000001001f79cb _ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE + 315
10 mongod 0x00000001002baef0 _ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE + 1120
11 mongod 0x00000001002bdd86 _ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE + 5734
12 mongod 0x00000001003c5c1d _ZN5mongo10connThreadEPNS_13MessagingPortE + 653
13 libboost_thread-mt.dylib 0x00000001007c6254 thread_proxy + 132
14 libSystem.B.dylib 0x00007fff887cb536 _pthread_start + 331
15 libSystem.B.dylib 0x00007fff887cb3e9 thread_start + 13
Mon Jan 31 14:17:55 [conn7] update test.col query: {} exception 0 assertion db/namespace.cpp:423 171ms
Mon Jan 31 14:17:55 [conn10] couldn't make room for new record (len: 124) in capped ns local.oplog.rs
Extent 0 (capExtent)
magic: 41424344 extent->ns: local.oplog.rs
fr: null lr: 2:3771fd0 extent->len: 192000256
local.oplog.rs Assertion failure len * 5 > lastExtentSize db/namespace.cpp 423
0x100037d3e 0x10003d9c6 0x100097ed8 0x10009b95d 0x10009773c 0x1002ca34a 0x100287691 0x1000914a2 0x1001f4f0f 0x1001f79cb 0x1002baef0 0x1002bdd86 0x1003c5c1d 0x1007c6254 0x7fff887cb536 0x7fff887cb3e9
0 mongod 0x0000000100037d3e _ZN5mongo12sayDbContextEPKc + 174
1 mongod 0x000000010003d9c6 _ZN5mongo8assertedEPKcS1_j + 342
2 mongod 0x0000000100097ed8 _ZNK5mongo16NamespaceDetails13maybeComplainEPKci + 1736
3 mongod 0x000000010009b95d _ZN5mongo16NamespaceDetails11cappedAllocEPKci + 2637
4 mongod 0x000000010009773c _ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE + 492
5 mongod 0x00000001002ca34a _ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci + 394
6 mongod 0x0000000100287691 _ZN5mongoL8_logOpRSEPKcS1_S1_RKNS_7BSONObjEPS2_Pb + 1761
7 mongod 0x00000001000914a2 _ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb + 66
8 mongod 0x00000001001f4f0f _ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE + 4367
9 mongod 0x00000001001f79cb _ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE + 315
10 mongod 0x00000001002baef0 _ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE + 1120
11 mongod 0x00000001002bdd86 _ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE + 5734
12 mongod 0x00000001003c5c1d _ZN5mongo10connThreadEPNS_13MessagingPortE + 653
13 libboost_thread-mt.dylib 0x00000001007c6254 thread_proxy + 132
14 libSystem.B.dylib 0x00007fff887cb536 _pthread_start + 331
15 libSystem.B.dylib 0x00007fff887cb3e9 thread_start + 13
Mon Jan 31 14:17:55 [conn10] update test.col query: {} exception 0 assertion db/namespace.cpp:423 660ms
Mon Jan 31 14:17:55 [conn9] couldn't make room for new record (len: 124) in capped ns local.oplog.rs
Extent 0 (capExtent)
magic: 41424344 extent->ns: local.oplog.rs
fr: null lr: 2:3771fd0 extent->len: 192000256
local.oplog.rs Assertion failure len * 5 > lastExtentSize db/namespace.cpp 423
0x100037d3e 0x10003d9c6 0x100097ed8 0x10009b95d 0x10009773c 0x1002ca34a 0x100287691 0x1000914a2 0x1001f4f0f 0x1001f79cb 0x1002baef0 0x1002bdd86 0x1003c5c1d 0x1007c6254 0x7fff887cb536 0x7fff887cb3e9
0 mongod 0x0000000100037d3e _ZN5mongo12sayDbContextEPKc + 174
1 mongod 0x000000010003d9c6 _ZN5mongo8assertedEPKcS1_j + 342
2 mongod 0x0000000100097ed8 _ZNK5mongo16NamespaceDetails13maybeComplainEPKci + 1736
3 mongod 0x000000010009b95d _ZN5mongo16NamespaceDetails11cappedAllocEPKci + 2637
4 mongod 0x000000010009773c _ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE + 492
5 mongod 0x00000001002ca34a _ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci + 394
6 mongod 0x0000000100287691 _ZN5mongoL8_logOpRSEPKcS1_S1_RKNS_7BSONObjEPS2_Pb + 1761
7 mongod 0x00000001000914a2 _ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb + 66
8 mongod 0x00000001001f4f0f _ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE + 4367
9 mongod 0x00000001001f79cb _ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE + 315
10 mongod 0x00000001002baef0 _ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE + 1120
11 mongod 0x00000001002bdd86 _ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE + 5734
12 mongod 0x00000001003c5c1d _ZN5mongo10connThreadEPNS_13MessagingPortE + 653
13 libboost_thread-mt.dylib 0x00000001007c6254 thread_proxy + 132
14 libSystem.B.dylib 0x00007fff887cb536 _pthread_start + 331
15 libSystem.B.dylib 0x00007fff887cb3e9 thread_start + 13
Mon Jan 31 14:17:55 [conn9] update test.col query: {} exception 0 assertion db/namespace.cpp:423 2312ms
Mon Jan 31 14:17:55 [initandlisten] connection accepted from 127.0.0.1:63236 #19
Mon Jan 31 14:17:55 [ReplSetHealthPollTask] replSet member 127.0.0.1:27019 RECOVERING
Mon Jan 31 14:17:57 [conn19] query local.oplog.rs reslen:12863 nscanned:103 { ts:

{ $gte: new Date(5568430794390648212) }

} nreturned:101 1656ms
Mon Jan 31 14:18:01 [conn19] Assertion: 10334:Invalid BSONObj size: 50599229 (0x3D150403) first element: : ?type=122
0x10003e25d 0x100008d1d 0x1002ce8b8 0x1001d6380 0x1002b682d 0x1002bda27 0x1003c5c1d 0x1007c6254 0x7fff887cb536 0x7fff887cb3e9
0 mongod 0x000000010003e25d _ZN5mongo11msgassertedEiPKc + 381
1 mongod 0x0000000100008d1d _ZNK5mongo7BSONObj14_assertInvalidEv + 1309
2 mongod 0x00000001002ce8b8 _ZN5mongo11BasicCursor7currentEv + 104
3 mongod 0x00000001001d6380 _ZN5mongo14processGetMoreEPKcixRNS_5CurOpEiRb + 1968
4 mongod 0x00000001002b682d _ZN5mongo15receivedGetMoreERNS_10DbResponseERNS_7MessageERNS_5CurOpE + 589
5 mongod 0x00000001002bda27 _ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE + 4871
6 mongod 0x00000001003c5c1d _ZN5mongo10connThreadEPNS_13MessagingPortE + 653
7 libboost_thread-mt.dylib 0x00000001007c6254 thread_proxy + 132
8 libSystem.B.dylib 0x00007fff887cb536 _pthread_start + 331
9 libSystem.B.dylib 0x00007fff887cb3e9 thread_start + 13
Mon Jan 31 14:18:01 [conn19] getmore local.oplog.rs cid:4746124974211895091 getMore: { query: {}, orderby:

{ $natural: -1 }

} exception 10334 Invalid BSONObj size: 50599229 (0x3D150403) first element: : ?type=122 bytes:20 nreturned:0 193ms
Mon Jan 31 14:18:11 [conn19] end connection 127.0.0.1:63236
Mon Jan 31 14:18:12 [initandlisten] connection accepted from 127.0.0.1:63246 #20
Mon Jan 31 14:18:13 [conn20] query local.oplog.rs reslen:12863 nscanned:103 { ts:

{ $gte: new Date(5568430794390648212) }

} nreturned:101 1525ms
Mon Jan 31 14:18:23 [conn20] end connection 127.0.0.1:63246
Mon Jan 31 14:18:24 [initandlisten] connection accepted from 127.0.0.1:63253 #21
Mon Jan 31 14:18:26 [conn21] query local.oplog.rs reslen:12863 nscanned:103 { ts:

{ $gte: new Date(5568430794390648212) }

} nreturned:101 1525ms
Mon Jan 31 14:18:36 [conn21] end connection 127.0.0.1:63253
Mon Jan 31 14:18:37 [initandlisten] connection accepted from 127.0.0.1:63260 #22
Mon Jan 31 14:18:38 [conn22] query local.oplog.rs reslen:12863 nscanned:103 { ts:

{ $gte: new Date(5568430794390648212) }

} nreturned:101 1529ms
Mon Jan 31 14:18:48 [conn22] end connection 127.0.0.1:63260
Mon Jan 31 14:18:49 [initandlisten] connection accepted from 127.0.0.1:63267 #23
Mon Jan 31 14:18:50 [dur] lsn set 234203
Mon Jan 31 14:18:51 [conn23] query local.oplog.rs reslen:12863 nscanned:103 { ts:

{ $gte: new Date(5568430794390648212) }

} nreturned:101 1670ms



 Comments   
Comment by Eliot Horowitz (Inactive) [ 03/Sep/11 ]

THis was probably one of the numerous journalling issues fixed

Comment by Tony Hannan [ 07/Feb/11 ]

My C++ test (from original description above) induces the following server errors occasionally. I'm unable to induce these from my Javascript test.

Error 1 - can't restart after kill:
Mon Feb 7 13:41:37 [initandlisten] recover cleaning up
Mon Feb 7 13:41:37 [initandlisten] removeJournalFiles
Mon Feb 7 13:41:37 [initandlisten] recover done
Mon Feb 7 13:41:38 [initandlisten] preallocateIsFaster=true 3.02
Mon Feb 7 13:41:38 [initandlisten] preallocating a journal file /db/db1/journal/prealloc.0
Mon Feb 7 13:42:01 [initandlisten] preallocating a journal file /db/db1/journal/prealloc.1
Mon Feb 7 13:42:41 [initandlisten] MongoDB starting : pid=2296 port=27018 dbpath=/db/db1 64-bit
Mon Feb 7 13:42:41 [initandlisten] db version v1.8.0-rc0-pre-, pdfile version 4.5
Mon Feb 7 13:42:41 [initandlisten] git version: b0834a3e4314353e6b4a9890343119d6810e8272
Mon Feb 7 13:42:41 [initandlisten] sys info: Darwin dmmb.local 10.5.0 Darwin Kernel Version 10.5.0: Fri Nov 5 23:20:39 PDT 2010; root:xnu-1504.9.17~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_44

                            • old lock file: /db/db1/mongod.lock. probably means unclean shutdown
                              but there are no journal files to recover.
                              see: http://dochub.mongodb.org/core/repair for more information
                              *************
                              Mon Feb 7 13:42:41 [initandlisten] exception in initAndListen std::exception: old lock file, terminating
                              ...

Error 2 - assertion failure:
Mon Feb 7 16:32:31 [ReplSetHealthPollTask] replSet info 127.0.0.1:27019 is up
Mon Feb 7 16:32:31 [ReplSetHealthPollTask] replSet member 127.0.0.1:27019 PRIMARY
Mon Feb 7 16:32:31 [ReplSetHealthPollTask] replSet info 127.0.0.1:27020 is up
Mon Feb 7 16:32:31 [ReplSetHealthPollTask] replSet member 127.0.0.1:27020 ARBITER
Mon Feb 7 16:32:31 [replica set sync] building new index on

{ _id: 1 }

for local.me
Mon Feb 7 16:32:31 [replica set sync] done for 0 records 0secs
Mon Feb 7 16:32:32 [replica set sync] replSet our last op time written: Feb 7 16:32:09:8d0
Mon Feb 7 16:32:32 [replica set sync] replset source's GTE: Feb 7 16:32:29:1
Mon Feb 7 16:32:32 [replica set sync] replSet rollback 0
Mon Feb 7 16:32:32 [replica set sync] replSet ROLLBACK
Mon Feb 7 16:32:32 [replica set sync] replSet rollback 1
Mon Feb 7 16:32:32 [replica set sync] replSet rollback 2 FindCommonPoint
Mon Feb 7 16:32:32 [replica set sync] replSet info rollback our last optime: Feb 7 16:32:09:8d0
Mon Feb 7 16:32:32 [replica set sync] replSet info rollback their last optime: Feb 7 16:32:32:ad4
Mon Feb 7 16:32:32 [replica set sync] replSet info rollback diff in end of log times: -23 seconds
Mon Feb 7 16:32:33 [replica set sync] replSet rollback found matching events at Feb 7 16:30:19:689
Mon Feb 7 16:32:33 [replica set sync] replSet rollback findcommonpoint scanned : 691768
Mon Feb 7 16:32:33 [replica set sync] replSet replSet rollback 3 fixup
Mon Feb 7 16:32:40 [replica set sync] replSet rollback 3.5
Mon Feb 7 16:32:40 [replica set sync] replSet rollback 4 n:10000
Mon Feb 7 16:32:40 [replica set sync] replSet minvalid=Feb 7 16:32:39 4d5064f7:1190
Mon Feb 7 16:32:40 [replica set sync] building new index on

{ _id: 1 }

for local.replset.minvalid
Mon Feb 7 16:32:40 [replica set sync] done for 0 records 0secs
Mon Feb 7 16:32:40 [replica set sync] replSet rollback 4.6
Mon Feb 7 16:32:40 [replica set sync] replSet rollback 4.7
Mon Feb 7 16:32:40 [replica set sync] replSet rollback 5 d:0 u:10000
Mon Feb 7 16:32:40 [replica set sync] replSet rollback 6
Mon Feb 7 16:32:41 [replica set sync] replSet syncThread: 13623 DR102 too much data written uncommitted
Mon Feb 7 16:32:41 [dur] lsn set 11485
Mon Feb 7 16:32:53 [replica set sync] replSet our last op time written: Feb 7 16:32:09:8d0
Mon Feb 7 16:32:53 [replica set sync] replset source's GTE: Feb 7 16:32:29:1
Mon Feb 7 16:32:53 [replica set sync] replSet rollback 0
Mon Feb 7 16:32:53 [replica set sync] replSet rollback 1
Mon Feb 7 16:32:53 [replica set sync] replSet rollback 2 FindCommonPoint
Mon Feb 7 16:32:53 [replica set sync] replSet rollback 2 error findcommonpoint waiting a while before trying again
Mon Feb 7 16:33:06 [replica set sync] DBClientCursor::init call() failed
Mon Feb 7 16:33:06 [conn5] end connection 127.0.0.1:58367
Mon Feb 7 16:33:06 [replica set sync] Assertion failure r.haveCursor() db/repl/rs_sync.cpp 285
0x10003853e 0x10003e176 0x10028667e 0x10028772e 0x1002878db 0x100287efb 0x100c44254 0x7fff887cb536 0x7fff887cb3e9
0 mongod 0x000000010003853e _ZN5mongo12sayDbContextEPKc + 174
1 mongod 0x000000010003e176 _ZN5mongo8assertedEPKcS1_j + 342
2 mongod 0x000000010028667e _ZN5mongo11ReplSetImpl8syncTailEv + 2558
3 mongod 0x000000010028772e _ZN5mongo11ReplSetImpl11_syncThreadEv + 318
4 mongod 0x00000001002878db _ZN5mongo11ReplSetImpl10syncThreadEv + 139
5 mongod 0x0000000100287efb _ZN5mongo15startSyncThreadEv + 731
6 libboost_thread-mt.dylib 0x0000000100c44254 thread_proxy + 132
7 libSystem.B.dylib 0x00007fff887cb536 _pthread_start + 331
8 libSystem.B.dylib 0x00007fff887cb3e9 thread_start + 13
Mon Feb 7 16:33:06 [replica set sync] replSet syncThread: 0 assertion db/repl/rs_sync.cpp:285

Comment by Tony Hannan [ 07/Feb/11 ]

Ok, jstests/multiClient/rsDurKillRestart1.js is now thread-safe. It gives following assertion error some of the time:

m31000| Sun Feb 6 22:42:43 [replica set sync] Assertion failure r.haveCursor() db/repl/rs_sync.cpp 285
m31000| 0x100038a1e 0x10003e656 0x10027a9fe 0x10027baae 0x10027bc5b 0x10027c27b 0x1007c8254 0x7fff887cb536 0x7fff887cb3e9
m31000| 0 mongod 0x0000000100038a1e _ZN5mongo12sayDbContextEPKc + 174
m31000| 1 mongod 0x000000010003e656 _ZN5mongo8assertedEPKcS1_j + 342
m31000| 2 mongod 0x000000010027a9fe _ZN5mongo11ReplSetImpl8syncTailEv + 2558
m31000| 3 mongod 0x000000010027baae _ZN5mongo11ReplSetImpl11_syncThreadEv + 318
m31000| 4 mongod 0x000000010027bc5b _ZN5mongo11ReplSetImpl10syncThreadEv + 139
m31000| 5 mongod 0x000000010027c27b _ZN5mongo15startSyncThreadEv + 731
m31000| 6 libboost_thread-mt.dylib 0x00000001007c8254 thread_proxy + 132
m31000| 7 libSystem.B.dylib 0x00007fff887cb536 _pthread_start + 331
m31000| 8 libSystem.B.dylib 0x00007fff887cb3e9 thread_start + 13
m31000| Sun Feb 6 22:42:43 [replica set sync] replSet syncThread: 0 assertion db/repl/rs_sync.cpp:285

Comment by Tony Hannan [ 04/Feb/11 ]

Ah, sorry, of course. I was used to connection object being thread-safe like in the other drivers. I will edit.

Comment by Eliot Horowitz (Inactive) [ 04/Feb/11 ]

I don't think your test is thread safe.

Your'e re-using the same connection to a server from multiple threads.

Comment by Tony Hannan [ 03/Feb/11 ]

Made same test case in javascript: jstests/multiClient/rsDurKillRestart1.js

1. Build mongo shell with V8 so multiple client threads can run
2. $ mongo --nodb
3. > load('jstests/multiClient/rsDurKillRestart1.js')
4. > go(100)

Get following 2 client errors in log. Cannot reproduce server error described above in this javascript version, only in above C++ version. Second error below does not happen as regularly as first.

Error 1:

update 1 round 59
Thu Feb 3 16:25:43 ERROR: MessagingPort::call() wrong id got:304d5c30 expect:304d5c2e
toSend op: 2004
response msgid:930392286
response len: 64
response op: 1
farEnd: 127.0.0.1:31001
Thu Feb 3 16:25:43 Assertion failure false util/message.cpp 512
0x1000111be 0x100013e16 0x10007bba1 0x10007c685 0x10008d483 0x1000ab16d 0x10008e46d 0x1000a0c5b 0x1000d777a 0x1000dc9d6 0x10585abbf
0 mongo 0x00000001000111be _ZN5mongo12sayDbContextEPKc + 174
1 mongo 0x0000000100013e16 _ZN5mongo8assertedEPKcS1_j + 342
2 mongo 0x000000010007bba1 ZN5mongo13MessagingPort4recvERKNS_7MessageERS1 + 705
3 mongo 0x000000010007c685 ZN5mongo13MessagingPort4callERNS_7MessageES2 + 53
4 mongo 0x000000010008d483 _ZN5mongo18DBClientConnection4callERNS_7MessageES2_b + 51
5 mongo 0x00000001000ab16d _ZN5mongo14DBClientCursor4initEv + 301
6 mongo 0x000000010008e46d _ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii + 1085
7 mongo 0x00000001000a0c5b _ZN5mongo18DBClientConnection5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii + 139
8 mongo 0x00000001000d777a _ZN5mongo9mongoFindERKN2v89ArgumentsE + 858
9 mongo 0x00000001000dc9d6 ZN5mongo10v8CallbackIXadL_ZNS_9mongoFindERKN2v89ArgumentsEEEEENS1_6HandleINS1_5ValueEEES4 + 118
10 ??? 0x000000010585abbf 0x0 + 4387613631
query failed (down primary), retry soon: socket error on query

Error 2:

Thu Feb 3 16:25:53 trying reconnect to 127.0.0.1:31001
Thu Feb 3 16:25:53 MessagingPort recv() errno:9 Bad file descriptor 127.0.0.1:31001
Thu Feb 3 16:25:53 mongo got signal 11 (Segmentation fault), stack trace:

Thu Feb 3 16:25:53 reconnect 127.0.0.1:31001 ok
Thu Feb 3 16:25:53 0x100020699 0x7fff887f267a 0x3abb0 0x10007b937 0x10007c685 0x10008d483 0x1000ab16d 0x10008e46d 0x1000a0c5b 0x1000d777a 0x1000dc9d6 0x10585abbf
0 mongo 0x0000000100020699 _Z12quitAbruptlyi + 1017
1 libSystem.B.dylib 0x00007fff887f267a _sigtramp + 26
2 ??? 0x000000000003abb0 0x0 + 240560
3 mongo 0x000000010007b937 ZN5mongo13MessagingPort4recvERKNS_7MessageERS1 + 87
4 mongo 0x000000010007c685 ZN5mongo13MessagingPort4callERNS_7MessageES2 + 53
5 mongo 0x000000010008d483 _ZN5mongo18DBClientConnection4callERNS_7MessageES2_b + 51
6 mongo 0x00000001000ab16d _ZN5mongo14DBClientCursor4initEv + 301
7 mongo 0x000000010008e46d _ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii + 1085
8 mongo 0x00000001000a0c5b _ZN5mongo18DBClientConnection5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii + 139
9 mongo 0x00000001000d777a _ZN5mongo9mongoFindERKN2v89ArgumentsE + 858
10 mongo 0x00000001000dc9d6 ZN5mongo10v8CallbackIXadL_ZNS_9mongoFindERKN2v89ArgumentsEEEEENS1_6HandleINS1_5ValueEEES4 + 118
11 ??? 0x000000010585abbf 0x0 + 4387613631

Comment by Eliot Horowitz (Inactive) [ 31/Jan/11 ]

Can you do this with our normal test harness.

Generated at Thu Feb 08 02:59:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.