[SERVER-2710] Running replSetReconfig while writing to a collection causes secondaries and arbiters to segfault. Created: 08/Mar/11  Updated: 12/Jul/16  Resolved: 03/May/11

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.6.4, 1.6.5, 1.8.0-rc1
Fix Version/s: 1.9.1

Type: Bug Priority: Major - P3
Reporter: Bernie Hackett Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux x64_64


Issue Links:
Depends
Duplicate
is duplicated by SERVER-3314 Segmentation Fault on Re-configuring ... Closed
is duplicated by SERVER-4091 rs.reconfig(...) on 1.8.4_rc0 causes ... Closed
is duplicated by SERVER-3381 Crash after enabling slave when balan... Closed
Related
is related to SERVER-3032 mongod crashed in ReplSetImpl summari... Closed
Operating System: ALL
Participants:

 Description   

In short:

1.Create a replica set with 6 nodes ( 1 primary, 3 secondaries and 2 arbiters.)
2. Insert a lot of trivial documents.
3. While doing the insert run replSetReconfig repeatedly.

Expected outcome:
No issues.

Actual Outcome:
1.6.x: One secondary and one arbiter crash with the following backtrace:

Tue Mar 8 13:05:08 [initandlisten] connection accepted from 127.0.0.1:50215 #6
Tue Mar 8 13:05:08 Got signal: 11 (Segmentation fault).

Tue Mar 8 13:05:08 Backtrace:
0x824629 0x7f47172a41f0 0x66c10b 0x67020a 0x6682e2 0x797117 0x798538 0x5fb7e5 0x60029f 0x7074ba 0x70aaf6 0x82691b 0x83a4b0 0x7f4717d7e914 0x7f47173437dd
/home/behackett/mongo/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x824629]
/lib/libc.so.6(+0x321f0) [0x7f47172a41f0]
/home/behackett/mongo/bin/mongod(ZN5mongo11ReplSetImpl17_fillIsMasterHostEPKNS_6MemberERSt6vectorISsSaISsEES7_S7+0x2b) [0x66c10b]
/home/behackett/mongo/bin/mongod(_ZN5mongo11ReplSetImpl13_fillIsMasterERNS_14BSONObjBuilderE+0x27a) [0x67020a]
/home/behackett/mongo/bin/mongod(_ZN5mongo11CmdIsMaster3runERKSsRNS_7BSONObjERSsRNS_14BSONObjBuilderEb+0x52) [0x6682e2]
/home/behackett/mongo/bin/mongod(_ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x597) [0x797117]
/home/behackett/mongo/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_10BufBuilderERNS_14BSONObjBuilderEbi+0x798) [0x798538]
/home/behackett/mongo/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_10BufBuilderERNS_14BSONObjBuilderEbi+0x35) [0x5fb7e5]
/home/behackett/mongo/bin/mongod(ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1+0x1bbf) [0x60029f]
/home/behackett/mongo/bin/mongod() [0x7074ba]
/home/behackett/mongo/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x14d6) [0x70aaf6]
/home/behackett/mongo/bin/mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x30b) [0x82691b]
/home/behackett/mongo/bin/mongod(thread_proxy+0x80) [0x83a4b0]
/lib/libpthread.so.0(+0x6914) [0x7f4717d7e914]
/lib/libc.so.6(clone+0x6d) [0x7f47173437dd]

Tue Mar 8 13:05:08 dbexit:

1.8.0-rc1: Backtrace in the log of one secondary:

Tue Mar 8 13:19:46 [initandlisten] connection accepted from 127.0.0.1:43559 #9
Tue Mar 8 13:19:57 [replica set sync] DBClientCursor::init call() failed
Tue Mar 8 13:19:57 [replica set sync] Assertion failure r.haveCursor() db/repl/rs_sync.cpp 288
0x530b21 0x541131 0x69ffe6 0x6a0c00 0x6a10b1 0x7f41d9fb2c57 0x7f41da6db914 0x7f41d8d127dd
mongod(_ZN5mongo12sayDbContextEPKc+0xb1) [0x530b21]
mongod(_ZN5mongo8assertedEPKcS1_j+0xc1) [0x541131]
mongod(_ZN5mongo11ReplSetImpl8syncTailEv+0xd96) [0x69ffe6]
mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x80) [0x6a0c00]
mongod(_ZN5mongo15startSyncThreadEv+0x1d1) [0x6a10b1]
/usr/lib/libboost_thread-mt-1_42.so.1.42.0(thread_proxy+0x77) [0x7f41d9fb2c57]
/lib/libpthread.so.0(+0x6914) [0x7f41da6db914]
/lib/libc.so.6(clone+0x6d) [0x7f41d8d127dd]
Tue Mar 8 13:19:57 [replica set sync] replSet syncThread: 0 assertion db/repl/rs_sync.cpp:288
Tue Mar 8 13:19:57 [conn9] end connection 127.0.0.1:43559



 Comments   
Comment by Kristina Chodorow (Inactive) [ 24/Jun/11 ]

Yes.

Comment by Dwight Merriman [ 24/Jun/11 ]

kristina will know more definitively but i believe yes.

Comment by Mike K [ 23/Jun/11 ]

Would this also fix the primary segfaulting when calling replSetReconfig? Have seen this happen with two different replica sets when removing a node.

Comment by auto [ 02/May/11 ]

Author:

{u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}

Message: never set self to 0 SERVER-2710
Branch: master
https://github.com/mongodb/mongo/commit/be8e2f261d50a410749f2934431d451eecdb47b3

Comment by Kristina Chodorow (Inactive) [ 08/Mar/11 ]

The closing of connections is intentional and expected.

Comment by Bernie Hackett [ 08/Mar/11 ]

I should also note that in 1.8.0-rc1 the primary closes all connections when stepping down then becoming primary again:

Tue Mar 8 13:30:41 [conn32] replSet replSetReconfig config object parses ok, 6 members specified
Tue Mar 8 13:30:41 [conn32] replSet replSetReconfig [2]
Tue Mar 8 13:30:41 [conn32] replSet info saving a newer config version to local.system.replset
Tue Mar 8 13:30:41 [conn32] replSet relinquishing primary state
Tue Mar 8 13:30:41 [conn32] replSet SECONDARY
Tue Mar 8 13:30:41 [conn32] replSet closing client sockets after reqlinquishing primary
Tue Mar 8 13:30:41 [conn32] replSet PRIMARY
Tue Mar 8 13:30:41 [conn32] replSet replSetReconfig new config saved locally
Tue Mar 8 13:30:41 [conn32] SocketException in connThread, closing client connection
Tue Mar 8 13:30:41 [conn30] SocketException in connThread, closing client connection
Tue Mar 8 13:30:41 [conn31] SocketException in connThread, closing client connection
Tue Mar 8 13:30:41 [conn29] SocketException in connThread, closing client connection
Tue Mar 8 13:30:41 [ReplSetHealthPollTask] replSet info behackett-dt:31018 is down (or slow to respond): socket exception
Tue Mar 8 13:30:41 [ReplSetHealthPollTask] replSet info behackett-dt:31021 is down (or slow to respond): socket exception
Tue Mar 8 13:30:41 [ReplSetHealthPollTask] replSet info behackett-dt:31022 is down (or slow to respond): socket exception
Tue Mar 8 13:30:41 [ReplSetHealthPollTask] replSet info behackett-dt:31019 is down (or slow to respond): socket exception
Tue Mar 8 13:30:41 [ReplSetHealthPollTask] replSet info behackett-dt:31020 is down (or slow to respond): socket exception

Generated at Thu Feb 08 03:00:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.