[SERVER-7163] Replica set crash with segfault Created: 26/Sep/12  Updated: 15/Feb/13  Resolved: 22/Oct/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Roman Janusz Assignee: Eric Milkie
Resolution: Duplicate Votes: 2
Labels: crash, replicaset
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 3.0.0-12-server #20-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux


Attachments: File mongocrash-acs1     File mongocrash-acs2     File mongocrash-acs3     File mongocrash-acs4     File mongocrash2-acs3     File mongocrash2-acs4    
Issue Links:
Duplicate
duplicates JAVA-660 Concurrency issue can cause corrupted... Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: Linux
Participants:

 Description   

We had a 4-node replica set crashed, all nodes at the same time.

Here are some log fragments showing the segfault:

Node 1:

Wed Sep 26 10:59:06 [conn29622] query datamodel.datamodel query: { _id: "C83A35-c83a35220710" } ntoreturn:1 idhack:1 keyUpdates:0 locks(micros) r:220254 reslen:256219 220ms
Wed Sep 26 10:59:08 [rsHealthPoll] replSet member ACS-PROV2:27017 is up
Wed Sep 26 10:59:08 [rsHealthPoll] replSet member ACS-PROV2:27017 is now in state SECONDARY
Wed Sep 26 10:59:08 [rsSync] info DFM::findAll(): extent 2d:4070b000 was empty, skipping ahead. ns:local.replset.minvalid
Wed Sep 26 10:59:09 [rsHealthPoll] replSet member ACS-PROV4:27017 is up
Wed Sep 26 10:59:09 [rsHealthPoll] replSet member ACS-PROV4:27017 is now in state SECONDARY
Wed Sep 26 10:59:09 [rsMgr] replSet warning caught unexpected exception in electSelf()
Wed Sep 26 10:59:09 Invalid access at address: 0x7f6a0b42e360 from thread: 
 
Wed Sep 26 10:59:09 Got signal: 11 (Segmentation fault).
 
Wed Sep 26 10:59:09 Backtrace:
0xade6e1 0x5582d9 0x558862 0x7fc5508c7060 0x7f6a0b42e360 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x558862]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x10060) [0x7fc5508c7060]
 [0x7f6a0b42e360]

Node 2:

Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46414 #35439 (42 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46413 #35440 (43 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.26:57056 #35441 (44 connections now open)
Wed Sep 26 10:59:13 [conn35441] recv(): message len 0 is too large0
Wed Sep 26 10:59:13 [conn35441] end connection 10.220.40.26:57056 (43 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.26:57057 #35442 (44 connections now open)
Wed Sep 26 10:59:13 [conn35414] recv(): message len 0 is too large0
Wed Sep 26 10:59:13 [conn35414] end connection 10.220.40.26:57054 (43 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.26:57058 #35443 (44 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.26:57062 #35444 (45 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.26:57065 #35445 (46 connections now open)
Wed Sep 26 10:59:13 [conn35444] Assertion: 10334:Invalid BSONObj size: 0 (0x00000000) first element: <98>^U^GG: ?type=112
0xade6e1 0x8036eb 0x80386c 0x571bc7 0x571f4c 0x7a2f42 0xb5b9b9 0xb5d052 0x56fa52 0x5dbf11 0x7fef22966efc 0x7fef21cf889d 
Wed Sep 26 10:59:13 [conn35445] recv(): message len 0 is too large0
Wed Sep 26 10:59:13 [conn35445] end connection 10.220.40.26:57065 (45 connections now open)
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0x8036eb]
 /usr/bin/mongod() [0x80386c]
 /usr/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x497) [0x571bc7]
 /usr/bin/mongod(_ZN5mongo7BSONObjC1EPKc+0x3c) [0x571f4c]
 /usr/bin/mongod(_ZN5mongo12QueryMessageC1ERNS_9DbMessageE+0xb2) [0x7a2f42]
 /usr/bin/mongod() [0xb5b9b9]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x3a2) [0xb5d052]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x82) [0x56fa52]
 /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x411) [0x5dbf11]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7fef22966efc]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fef21cf889d]
Wed Sep 26 10:59:13 [conn35444] AssertionException handling request, closing client connection: 10334 Invalid BSONObj size: 0 (0x00000000) first element: <98>^U^GG: ?type=112
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.25:47291 #35446 (45 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.25:47292 #35447 (46 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46415 #35448 (47 connections now open)
Wed Sep 26 10:59:13 [conn35448] Assertion: 10334:Invalid BSONObj size: 809775172 (0x44304430) first element: 0022938C388C: ?type=45
0xade6e1 0x8036eb 0x80386c 0x571bc7 0x571f4c 0x7a2f42 0xb5b9b9 0xb5d052 0x56fa52 0x5dbf11 0x7fef22966efc 0x7fef21cf889d 
Wed Sep 26 10:59:13 [conn35437] recv(): message len 0 is too large0
Wed Sep 26 10:59:13 [conn35437] end connection 10.220.40.23:46410 (46 connections now open)
Wed Sep 26 10:59:13 [conn35419] end connection 10.220.40.23:46387 (45 connections now open)
Wed Sep 26 10:59:13 [conn35418] end connection 10.220.40.23:46386 (44 connections now open)
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0x8036eb]
 /usr/bin/mongod() [0x80386c]
 /usr/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x497) [0x571bc7]
 /usr/bin/mongod(_ZN5mongo7BSONObjC1EPKc+0x3c) [0x571f4c]
 /usr/bin/mongod(_ZN5mongo12QueryMessageC1ERNS_9DbMessageE+0xb2) [0x7a2f42]
 /usr/bin/mongod() [0xb5b9b9]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x3a2) [0xb5d052]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x82) [0x56fa52]
 /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x411) [0x5dbf11]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7fef22966efc]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fef21cf889d]
Wed Sep 26 10:59:13 [conn35448] AssertionException handling request, closing client connection: 10334 Invalid BSONObj size: 809775172 (0x44304430) first element: 0022938C388C: ?type=45
Wed Sep 26 10:59:13 [conn35423] recv(): message len 0 is too large0
Wed Sep 26 10:59:13 [conn35423] end connection 10.220.40.23:46394 (42 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46426 #35449 (43 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46433 #35450 (44 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46432 #35451 (45 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46440 #35452 (46 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46441 #35453 (47 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46445 #35454 (48 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46446 #35455 (49 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46447 #35456 (50 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46448 #35457 (51 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.23:46449 #35458 (52 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.25:47294 #35459 (53 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.26:57066 #35460 (54 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.26:57068 #35461 (55 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.26:57070 #35462 (56 connections now open)
Wed Sep 26 10:59:13 [initandlisten] connection accepted from 10.220.40.26:57071 #35463 (57 connections now open)
Wed Sep 26 10:59:13 Invalid access at address: 0x7f93fb9d2000 from thread: conn35459
 
Wed Sep 26 10:59:13 Got signal: 11 (Segmentation fault).
 
Wed Sep 26 10:59:13 Backtrace:
0xade6e1 0x5582d9 0x558862 0x7fef2296f060 0x7fef21d5b056 0x610243 0x93070b 0x93d90b 0x944669 0x944f72 0x9456c7 0x880697 0x8810c6 0x849cfa 0x84b258 0x84b41b 0x6b046c 0x6b3d38 0xb5ba7d 0xb5d052 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x558862]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x10060) [0x7fef2296f060]
 /lib/x86_64-linux-gnu/libc.so.6(+0x144056) [0x7fef21d5b056]
 /usr/bin/mongod(_ZN5mongo20compareElementValuesERKNS_11BSONElementES2_+0x313) [0x610243]
 /usr/bin/mongod(_ZN5mongo10FieldRange9intersectERKS0_b+0x12b) [0x93070b]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSet19intersectMatchFieldEPKcRKNS_11BSONElementEbb+0x5b) [0x93d90b]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSet16handleMatchFieldERKNS_11BSONElementEb+0x1b9) [0x944669]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSet4initEb+0xa2) [0x944f72]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSetC1EPKcRKNS_7BSONObjEbb+0x147) [0x9456c7]
 /usr/bin/mongod(_ZN5mongo16MultiPlanScanner4initERKNS_7BSONObjES3_S3_+0x87) [0x880697]
 /usr/bin/mongod(_ZN5mongo16MultiPlanScanner4makeEPKcRKNS_7BSONObjES5_RKN5boost10shared_ptrIKNS_11ParsedQueryEEES5_NS_18QueryPlanGenerator18RecordedPlanPolicyES5_S5_+0x76) [0x8810c6]
 /usr/bin/mongod(_ZN5mongo15CursorGenerator19setMultiPlanScannerEv+0xea) [0x849cfa]
 /usr/bin/mongod(_ZN5mongo15CursorGenerator8generateEv+0x98) [0x84b258]
 /usr/bin/mongod(_ZN5mongo25NamespaceDetailsTransient9getCursorEPKcRKNS_7BSONObjES5_RKNS_24QueryPlanSelectionPolicyEPbRKN5boost10shared_ptrIKNS_11ParsedQueryEEEbPNS_16QueryPlanSummaryE+0x4b) [0x84b41b]
 /usr/bin/mongod(_ZN5mongo23queryWithQueryOptimizerEiRKSsRKNS_7BSONObjERNS_5CurOpES4_S4_RKN5boost10shared_ptrINS_11ParsedQueryEEES4_RKNS_17ShardChunkVersionERNS7_10scoped_ptrINS_25PageFaultRetryableSectionEEERNSG_INS_19NoPageFaultsAllowedEEERNS_7MessageE+0xfc) [0x6b046c]
 /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x16b8) [0x6b3d38]
 /usr/bin/mongod() [0xb5ba7d]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x3a2) [0xb5d052]

Node 3:

Wed Sep 26 10:59:24 [initandlisten] connection accepted from 10.220.40.23:40409 #40515 (269 connections now open)
Wed Sep 26 10:59:24 [initandlisten] connection accepted from 10.220.40.26:52395 #40516 (270 connections now open)
Wed Sep 26 10:59:24 [conn39893] end connection 10.220.40.23:39959 (269 connections now open)
Wed Sep 26 10:59:24 [conn40022] end connection 10.220.40.23:40038 (268 connections now open)
Wed Sep 26 10:59:24 [conn40424] end connection 10.220.40.26:52385 (267 connections now open)
Wed Sep 26 10:59:24 [initandlisten] connection accepted from 10.220.40.23:40410 #40517 (268 connections now open)
Wed Sep 26 10:59:24 Invalid access at address: 0x7edd6b5ac000 from thread: conn39940
 
Wed Sep 26 10:59:24 Got signal: 11 (Segmentation fault).
 
Wed Sep 26 10:59:24 Backtrace:
0xade6e1 0x5582d9 0x558862 0x7f37fcddf060 0x7f37fc1cb048 0x610243 0x93070b 0x93d90b 0x944669 0x944f72 0x9456c7 0x880697 0x8810c6 0x849cfa 0x84b258 0x84b41b 0xc07af3 0xc0cddd 0xb546af 0xb5dd61 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x558862]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x10060) [0x7f37fcddf060]
 /lib/x86_64-linux-gnu/libc.so.6(+0x144048) [0x7f37fc1cb048]
 /usr/bin/mongod(_ZN5mongo20compareElementValuesERKNS_11BSONElementES2_+0x313) [0x610243]
 /usr/bin/mongod(_ZN5mongo10FieldRange9intersectERKS0_b+0x12b) [0x93070b]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSet19intersectMatchFieldEPKcRKNS_11BSONElementEbb+0x5b) [0x93d90b]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSet16handleMatchFieldERKNS_11BSONElementEb+0x1b9) [0x944669]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSet4initEb+0xa2) [0x944f72]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSetC1EPKcRKNS_7BSONObjEbb+0x147) [0x9456c7]
 /usr/bin/mongod(_ZN5mongo16MultiPlanScanner4initERKNS_7BSONObjES3_S3_+0x87) [0x880697]
 /usr/bin/mongod(_ZN5mongo16MultiPlanScanner4makeEPKcRKNS_7BSONObjES5_RKN5boost10shared_ptrIKNS_11ParsedQueryEEES5_NS_18QueryPlanGenerator18RecordedPlanPolicyES5_S5_+0x76) [0x8810c6]
 /usr/bin/mongod(_ZN5mongo15CursorGenerator19setMultiPlanScannerEv+0xea) [0x849cfa]
 /usr/bin/mongod(_ZN5mongo15CursorGenerator8generateEv+0x98) [0x84b258]
 /usr/bin/mongod(_ZN5mongo25NamespaceDetailsTransient9getCursorEPKcRKNS_7BSONObjES5_RKNS_24QueryPlanSelectionPolicyEPbRKN5boost10shared_ptrIKNS_11ParsedQueryEEEbPNS_16QueryPlanSummaryE+0x4b) [0x84b41b]
 /usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyE+0x4b3) [0xc07af3]
 /usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0x12d) [0xc0cddd]
 /usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x4cf) [0xb546af]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x10b1) [0xb5dd61]

Node 4:

Wed Sep 26 10:59:17 [initandlisten] connection accepted from 10.220.40.26:48214 #34542 (29 connections now open)
Wed Sep 26 10:59:17 [conn34542] recv(): message len 0 is too large0
Wed Sep 26 10:59:17 [conn34542] end connection 10.220.40.26:48214 (27 connections now open)
Wed Sep 26 10:59:17 [initandlisten] connection accepted from 10.220.40.26:48215 #34543 (28 connections now open)
Wed Sep 26 10:59:17 [initandlisten] connection accepted from 10.220.40.26:48216 #34544 (29 connections now open)
Wed Sep 26 10:59:17 [initandlisten] connection accepted from 10.220.40.26:48217 #34545 (30 connections now open)
Wed Sep 26 10:59:17 [conn34544] recv(): message len 4 is too large4
Wed Sep 26 10:59:17 [conn34544] end connection 10.220.40.26:48216 (29 connections now open)
Wed Sep 26 10:59:17 [conn34543] Assertion: 10306:Client Error: Next object larger than space left in message
Wed Sep 26 10:59:17 [conn34545] recv(): message len 4 is too large4
0xade6e1 0x8036eb 0x7a3364 0xb5b9b9 0xb5d052 0x56fa52 0x5dbf11 0x7f74470f7efc 0x7f744648989d 
Wed Sep 26 10:59:17 [conn34545] end connection 10.220.40.26:48217 (28 connections now open)
Wed Sep 26 10:59:17 [initandlisten] connection accepted from 10.220.40.26:48218 #34546 (29 connections now open)
Wed Sep 26 10:59:17 [initandlisten] connection accepted from 10.220.40.26:48219 #34547 (30 connections now open)
Wed Sep 26 10:59:17 [conn34546] recv(): message len 4 is too large4
Wed Sep 26 10:59:17 [conn34546] end connection 10.220.40.26:48218 (29 connections now open)
Wed Sep 26 10:59:17 [conn34547] recv(): message len 0 is too large0
Wed Sep 26 10:59:17 [conn34547] end connection 10.220.40.26:48219 (29 connections now open)
Wed Sep 26 10:59:17 Backtrace:
0xade6e1 0x5582d9 0x558862 0x7f7447100060 0x7f74464ec02e 0x610243 0x93070b 0x93d90b 0x944669 0x944f72 0x9456c7 0x880697 0x8810c6 0x849cfa 0x84b258 0x84b41b 0x6b046c 0x6b3d38 0xb5ba7d 0xb5d052 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x558862]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x10060) [0x7f7447100060]
 /lib/x86_64-linux-gnu/libc.so.6(+0x14402e) [0x7f74464ec02e]
 /usr/bin/mongod(_ZN5mongo20compareElementValuesERKNS_11BSONElementES2_+0x313) [0x610243]
 /usr/bin/mongod(_ZN5mongo10FieldRange9intersectERKS0_b+0x12b) [0x93070b]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSet19intersectMatchFieldEPKcRKNS_11BSONElementEbb+0x5b) [0x93d90b]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSet16handleMatchFieldERKNS_11BSONElementEb+0x1b9) [0x944669]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSet4initEb+0xa2) [0x944f72]
 /usr/bin/mongod(_ZN5mongo13FieldRangeSetC1EPKcRKNS_7BSONObjEbb+0x147) [0x9456c7]
 /usr/bin/mongod(_ZN5mongo16MultiPlanScanner4initERKNS_7BSONObjES3_S3_+0x87) [0x880697]
 /usr/bin/mongod(_ZN5mongo16MultiPlanScanner4makeEPKcRKNS_7BSONObjES5_RKN5boost10shared_ptrIKNS_11ParsedQueryEEES5_NS_18QueryPlanGenerator18RecordedPlanPolicyES5_S5_+0x76) [0x8810c6]
 /usr/bin/mongod(_ZN5mongo15CursorGenerator19setMultiPlanScannerEv+0xea) [0x849cfa]
 /usr/bin/mongod(_ZN5mongo15CursorGenerator8generateEv+0x98) [0x84b258]
 /usr/bin/mongod(_ZN5mongo25NamespaceDetailsTransient9getCursorEPKcRKNS_7BSONObjES5_RKNS_24QueryPlanSelectionPolicyEPbRKN5boost10shared_ptrIKNS_11ParsedQueryEEEbPNS_16QueryPlanSummaryE+0x4b) [0x84b41b]
 /usr/bin/mongod(_ZN5mongo23queryWithQueryOptimizerEiRKSsRKNS_7BSONObjERNS_5CurOpES4_S4_RKN5boost10shared_ptrINS_11ParsedQueryEEES4_RKNS_17ShardChunkVersionERNS7_10scoped_ptrINS_25PageFaultRetryableSectionEEERNSG_INS_19NoPageFaultsAllowedEEERNS_7MessageE+0xfc) [0x6b046c]
 /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x16b8) [0x6b3d38]
 /usr/bin/mongod() [0xb5ba7d]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x3a2) [0xb5d052]



 Comments   
Comment by Eric Milkie [ 25/Oct/12 ]

SERVER-6287 will cover validation of the wire protocol.

Comment by Nico Schottelius [ 25/Oct/12 ]

I was wondering - shouldn't be there a bugfix in the server as well? I mean, a malformed request can always come in from the network and the server should probably close the connection to that particular client and ignore it instead of segfaulting.

Comment by Daniel Pasette (Inactive) [ 22/Oct/12 ]

This crash was caused by JAVA-660.

Comment by Eric Milkie [ 02/Oct/12 ]

It absolutely should but it doesn't yet. The message parsing off the wire does minimal protocol conformance checks. This is something we're looking to add in a future release.

Comment by Roman Janusz [ 02/Oct/12 ]

I created separate issue for the Java driver, but shouldn't the server be able to protect itself against such situations?

Comment by Jeffrey Yemin [ 02/Oct/12 ]

Yes, please do.

Comment by Roman Janusz [ 02/Oct/12 ]

Mongo is now up for almost 90 hours without any issue, so it seems quite likely that the driver was the source of crashes. Should I create another ticket for that?

Comment by Roman Janusz [ 29/Sep/12 ]

Right now replica set in version 2.0.7 is up for almost 30 hours and we did not have any problems with it. We will also try to upgrade mongo again to 2.2.0 and see if something bad happens with that version, but I guess at the moment the main suspicion is that Java Mongo driver was broken in some version above 2.8.0.

By the way, another interesting symptom of our problems with mongo is that some corrupt data appeared in database - supposedly during the time that driver 2.9.1 was used. For example, badly-named databases and collections appeared in our database. Their names are often malformed versions of correct names - e.g. some part of the name is missing. If this helps with debugging, our collection names are often fully qualified Java class names, so they're long and contain a few dots.

Comment by Eric Milkie [ 28/Sep/12 ]

In the meantime I'd check to make sure everything is configured as you expect, and that the port numbers are correct for all servers and drivers.

Comment by Roman Janusz [ 28/Sep/12 ]

There is no way that anything other than our application connects to database. These IPs are OK, our application is running on these hosts. Could this be possible that MongoDB Java Driver 2.9.1 is corrupt? We have made an upgrade from 2.8.0 to 2.9.1 just about the time mongo started regularly crashing. We have downgraded the driver to 2.8.0 and we are monitoring the situation. We'll let you know if mongo crashes again.

Comment by Eric Milkie [ 28/Sep/12 ]

Hello Roman. Can you identify server 10.220.40.25 or 10.220.40.26? They have processes connecting to the mongod port but the processes aren't valid mongo drivers.

Comment by Roman Janusz [ 28/Sep/12 ]

Again, system crashed. This time however it was MongoDB version 2.0.7 and 2 out of 4 nodes crashed (acs3, acs4). Logs are attached.

Comment by Roman Janusz [ 27/Sep/12 ]

The crash has happened again. We suspected corrupted data to be the reason last time as mongo kept crashing, but this time we started with empty database and it run about 20h before finally crashing.

Here is more detailed info about the environment:

Mongo was running in replicaset on 4 nodes:
acs1 (10.220.40.23)
acs2 (10.220.40.24)
acs3 (10.220.40.25)
acs4 (10.220.40.26)

acs1 was most probably the primary node - it got all the writes and a little bit of reads (most reads were going to secondaries)

Here is current result from rs.config() - after the restart:

SECONDARY> rs.config()
{
	"_id" : "ump-avs",
	"version" : 4,
	"members" : [
		{
			"_id" : 0,
			"host" : "acs1:27017"
		},
		{
			"_id" : 1,
			"host" : "acs2:27017"
		},
		{
			"_id" : 2,
			"host" : "acs3:27017"
		},
		{
			"_id" : 3,
			"host" : "acs4:27017"
		}
	]
}

Application was running on three nodes - acs1, acs3, acs4. It is a Java application using Java MongoDB driver 2.9.1.
Database names are 'ump' and 'datamodel'.

Logs from all nodes are attached.

Comment by Eric Milkie [ 26/Sep/12 ]

Can you post more of the logs from each of the four nodes? Also post the configuration of your replica set by issuing rs.config() in the Mongo shell.

At first glance, it appears that one or more rogue servers connected to the mongod's but did not speak the correct wire protocol. Can you identify the servers by the ip addresses in the log fragments? For example, who is 10.220.40.25?

Generated at Thu Feb 08 03:13:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.