[SERVER-1483] Shutdown / rs.stepDown() of master creates random slave crashes Created: 25/Jul/10  Updated: 12/Jul/16  Resolved: 16/Aug/10

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.5.6
Fix Version/s: 1.6.1

Type: Bug Priority: Major - P3
Reporter: Alex Assignee: Dwight Merriman
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu 10.04 64bit mongodb 1.5.6


Issue Links:
Duplicate
is duplicated by SERVER-1533 Replica set SECONDARY crash when term... Closed
Operating System: ALL
Participants:

 Description   

killing mongod on the master either by "command line kill" or "mongo rs.stepDown()" randomly crashes slaves. not all the time, but a lot of times.

slave log:

[conn561] Sun Jul 25 00:38:42 replSet TEMP RECEIVED ELECT MSG

{ replSetElect: 1, set: "<replset>", who: "<master host>", whoid: 2, cfgver: 5, round: ObjectId('4c4b879276b7cf142d74a41f') }

[conn561] Sun Jul 25 00:38:42 replSet info voting yea for 2
Sun Jul 25 00:38:43 Got signal: 11 (Segmentation fault).

Sun Jul 25 00:38:43 Backtrace:
0x7e5759 0x7f470d348af0 0x5392c0 0x66ff69 0x538df3 0x51eacb 0x7fa960 0x7f470de4c9ca 0x7f470d3fb6fd
/export/mongodb/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x7e5759]
/lib/libc.so.6(+0x33af0) [0x7f470d348af0]
/export/mongodb/bin/mongod(_ZN5mongo4task6Server4sendEN5boost8functionIFvvEEE+0x50) [0x5392c0]
/export/mongodb/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x1159) [0x66ff69]
/export/mongodb/bin/mongod(_ZN5mongo4task4Task3runEv+0x33) [0x538df3]
/export/mongodb/bin/mongod(_ZN5mongo13BackgroundJob3thrEv+0x8b) [0x51eacb]
/export/mongodb/bin/mongod(thread_proxy+0x80) [0x7fa960]
/lib/libpthread.so.0(+0x69ca) [0x7f470de4c9ca]
/lib/libc.so.6(clone+0x6d) [0x7f470d3fb6fd]

Sun Jul 25 00:38:43 dbexit:

[ReplSetHealthPollTask] Sun Jul 25 00:38:43 shutdown: going to close listening sockets...
[ReplSetHealthPollTask] Sun Jul 25 00:38:43 going to close listening socket: 13
[ReplSetHealthPollTask] Sun Jul 25 00:38:43 going to close listening socket: 14
[ReplSetHealthPollTask] Sun Jul 25 00:38:43 going to close listening socket: 15
[ReplSetHealthPollTask] Sun Jul 25 00:38:43 going to close listening socket: 16
[ReplSetHealthPollTask] Sun Jul 25 00:38:43 shutdown: going to flush oplog...
[ReplSetHealthPollTask] Sun Jul 25 00:38:43 shutdown: going to close sockets...
[ReplSetHealthPollTask] Sun Jul 25 00:38:43 shutdown: waiting for fs preallocator...
[ReplSetHealthPollTask] Sun Jul 25 00:38:43 shutdown: closing all files...
[initandlisten] Sun Jul 25 00:38:43 now exiting
Sun Jul 25 00:38:43 dbexit: ; exiting immediately

Sun Jul 25 00:38:43 Got signal: 6 (Aborted).



 Comments   
Comment by Alex [ 06/Aug/10 ]

Was not able to reproduce with latest 1.6 nightly. Thanks

Comment by Dwight Merriman [ 05/Aug/10 ]

thanks @david. i tried to reproduce but could not myself. There is a "speculative fix" (couldn't verify as couldn't reproduce) that is in 1.6.0 that was not in 1.5.8.

Comment by David Mytton [ 05/Aug/10 ]

I have been unable to reproduce the original reported problem on 1.5.8 (my case SERVER-1533) having tried over the last few days.

Comment by Dwight Merriman [ 04/Aug/10 ]

a couple things to possibly try if you can:

(1) try the above commit patch

(2) a newer version of boost with the old code

trying either one and reporting back would be helpful.

thanks

Comment by auto [ 04/Aug/10 ]

Author:

{'login': 'dwight', 'name': 'Dwight', 'email': 'dwight@10gen.com'}

Message: trying to fix SERVER-1483
http://github.com/mongodb/mongo/commit/6f6aa20e92072940d55308be4c8e4b2e70e579fc

Comment by auto [ 04/Aug/10 ]

Author:

{'login': 'dwight', 'name': 'Dwight', 'email': 'dwight@10gen.com'}

Message: trying to fix SERVER-1483
http://github.com/mongodb/mongo/commit/9c19c738a489e3dd50f2a3f7d6db6b2a49579b7d

Comment by Frank Steinborn [ 04/Aug/10 ]

Sure:

Wed Aug 4 18:04:22 MongoDB starting : pid=1867 port=27017 dbpath=/var/lib/mongodb 64-bit
Wed Aug 4 18:04:22 db version v1.5.8, pdfile version 4.5
Wed Aug 4 18:04:22 git version: nogitversion
Wed Aug 4 18:04:22 sys info: Linux rijkaard.dev.sipgate.net 2.6.26-2-amd64 #1 SMP Wed May 12 18:03:14 UTC 2010 i686 BOOST_LIB_VERSION=1_34_1

It's both built an run on Debian Lenny 64 bit.

Comment by Dwight Merriman [ 04/Aug/10 ]

i think i have a lead on it now. working on it.

Comment by Dwight Merriman [ 04/Aug/10 ]

@frank can you post the header from that log so i have the build version # details and the OS details tx.

Comment by Frank Steinborn [ 04/Aug/10 ]

I could not reproduce the crash of a secondary on rs.stepDown(), however, I see random crashes of the arbiter when the master does an rs.stepDown() during heavy write-load (e.g. mongorestore). Please let me know if I should open a new issue for that, but it sounds related...

This is with one primary, one secondary and one arbiter on version 1.5.8. Log of the arbiter after rs.stepDown() on primary:

[ReplSetHealthPollTask] Wed Aug 4 14:57:17 replSet info ramos.dev.sipgate.net:27017 is now down (or slow to respond)
Wed Aug 4 14:57:17 Got signal: 6 (Aborted).

Wed Aug 4 14:57:17 Backtrace:
0x739cb8 0x7faacba92f60 0x7faacba92ed5 0x7faacba943f3 0x7faacba8bdc9 0x50f548 0x60bd45 0x50fc8b 0x4f7880 0x7faaccf9e86d 0x7faacbdbafc7 0x7faacbb3064d
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x528) [0x739cb8]
/lib/libc.so.6 [0x7faacba92f60]
/lib/libc.so.6(gsignal+0x35) [0x7faacba92ed5]
/lib/libc.so.6(abort+0x183) [0x7faacba943f3]
/lib/libc.so.6(__assert_fail+0xe9) [0x7faacba8bdc9]
/usr/bin/mongod(_ZN5mongo4task6Server4sendEN5boost8functionIFvvESaIvEEE+0x138) [0x50f548]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x8c5) [0x60bd45]
/usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x7b) [0x50fc8b]
/usr/bin/mongod(_ZN5mongo13BackgroundJob3thrEv+0x80) [0x4f7880]
/usr/lib/libboost_thread-mt.so.1.35.0(thread_proxy+0x6d) [0x7faaccf9e86d]
/lib/libpthread.so.0 [0x7faacbdbafc7]
/lib/libc.so.6(clone+0x6d) [0x7faacbb3064d]

Wed Aug 4 14:57:17 dbexit:

[ReplSetHealthPollTask] Wed Aug 4 14:57:17 shutdown: going to close listening sockets...
[ReplSetHealthPollTask] Wed Aug 4 14:57:17 going to close listening socket: 11
[ReplSetHealthPollTask] Wed Aug 4 14:57:17 going to close listening socket: 12
[ReplSetHealthPollTask] Wed Aug 4 14:57:17 going to close listening socket: 13
[ReplSetHealthPollTask] Wed Aug 4 14:57:17 going to close listening socket: 14
[ReplSetHealthPollTask] Wed Aug 4 14:57:17 shutdown: going to flush oplog...
[ReplSetHealthPollTask] Wed Aug 4 14:57:17 shutdown: going to close sockets...
[conn2] Wed Aug 4 14:57:17 SocketException: 9001 socket exception
[conn2] Wed Aug 4 14:57:17 end connection 127.0.0.1:53185
[ReplSetHealthPollTask] Wed Aug 4 14:57:17 shutdown: waiting for fs preallocator...
[ReplSetHealthPollTask] Wed Aug 4 14:57:17 shutdown: closing all files...
Wed Aug 4 14:57:17 closeAllFiles() finished

[ReplSetHealthPollTask] Wed Aug 4 14:57:17 shutdown: removing fs lock...
Wed Aug 4 14:57:17 dbexit: really exiting now

Thanks,
Frank

Comment by Alex [ 04/Aug/10 ]

Will test on weekend

Comment by Dwight Merriman [ 02/Aug/10 ]

added some extra logging to help diagnose - please try again with the latest latest code when convenient.

Comment by auto [ 02/Aug/10 ]

Author:

{'login': 'dwight', 'name': 'Dwight', 'email': 'dwight@10gen.com'}

Message: defensive re SERVER-1483
http://github.com/mongodb/mongo/commit/d949b32d3ca9f492d486a9dc1b42752a987b690a

Comment by Alex [ 31/Jul/10 ]

Still exists in 1.5.7 now as stated above in repl/manager.cpp 55

another dump this time from 32bit node:

Sat Jul 31 22:00:41 [conn2] replSet info voting yea for 1
Sat Jul 31 22:00:41 [rs Manager] should never be called?
Sat Jul 31 22:00:41 [rs Manager] Assertion failure false db/repl/manager.cpp 55
Sat Jul 31 22:00:41 terminate() called, printing stack:

Sat Jul 31 22:00:41 Got signal: 6 (Aborted).

Sat Jul 31 22:00:41 Backtrace:
0x8477592 0xb780c400 0xb780c416 0xb7579851 0xb757cd42 0x8476207 0xb77c9465 0xb77c94a2 0x849087a 0xb76b2985 0xb761f13e
/export/mongodb/bin/mongod(_ZN5mongo10abruptQuitEi+0x3c2) [0x8477592]
[0xb780c400]
[0xb780c416]
/lib/tls/i686/nosegneg/libc.so.6(gsignal+0x51) [0xb7579851]
/lib/tls/i686/nosegneg/libc.so.6(abort+0x182) [0xb757cd42]
/export/mongodb/bin/mongod(_ZN5mongo11myterminateEv+0x137) [0x8476207]
/usr/lib/libstdc++.so.6(+0xbd465) [0xb77c9465]
/usr/lib/libstdc++.so.6(+0xbd4a2) [0xb77c94a2]
/export/mongodb/bin/mongod(thread_proxy+0x15a) [0x849087a]
/lib/tls/i686/nosegneg/libpthread.so.0(+0x5985) [0xb76b2985]
/lib/tls/i686/nosegneg/libc.so.6(clone+0x5e) [0xb761f13e]

Sat Jul 31 22:00:41 dbexit:

Sat Jul 31 22:00:41 [rs Manager] shutdown: going to close listening sockets...
Sat Jul 31 22:00:41 [rs Manager] going to close listening socket: 9
Sat Jul 31 22:00:41 [rs Manager] going to close listening socket: 10
Sat Jul 31 22:00:41 [rs Manager] going to close listening socket: 11
Sat Jul 31 22:00:41 [rs Manager] going to close listening socket: 12
Sat Jul 31 22:00:41 [rs Manager] shutdown: going to flush oplog...
Sat Jul 31 22:00:41 [rs Manager] shutdown: going to close sockets...
Sat Jul 31 22:00:41 [rs Manager] shutdown: waiting for fs preallocator...
Sat Jul 31 22:00:41 [rs Manager] shutdown: closing all files...
Sat Jul 31 22:00:41 closeAllFiles() finished

Sat Jul 31 22:00:41 [rs Manager] shutdown: removing fs lock...
Sat Jul 31 22:00:41 dbexit: really exiting now

Comment by Alex [ 29/Jul/10 ]

rs.stepDown() on master, using latest nightly build:

Thu Jul 29 18:54:19 [conn1] replSet info voting yea for 0
Thu Jul 29 18:54:20 [rs Manager] should never be called?
Thu Jul 29 18:54:20 [rs Manager] Assertion failure false db/repl/manager.cpp 55
Thu Jul 29 18:54:20 terminate() called, printing stack:

Thu Jul 29 18:54:20 Got signal: 6 (Aborted).

Thu Jul 29 18:54:20 Backtrace:
0x8000c9 0x7f8ece44aaf0 0x7f8ece44aa75 0x7f8ece44e5c0 0x7ff57f 0x7f8ececfed16 0x7f8ececfed43 0x7f8ececfee3e 0x53cdec 0x67d0d8 0x526d4e 0x815780 0x7f8ecef4e9ca 0x7f8ece4fd6fd
/export/mongodb/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x8000c9]
/lib/libc.so.6(+0x33af0) [0x7f8ece44aaf0]
/lib/libc.so.6(gsignal+0x35) [0x7f8ece44aa75]
/lib/libc.so.6(abort+0x180) [0x7f8ece44e5c0]
/export/mongodb/bin/mongod(_ZN5mongo11myterminateEv+0x10f) [0x7ff57f]
/usr/lib/libstdc++.so.6(+0xcad16) [0x7f8ececfed16]
/usr/lib/libstdc++.so.6(+0xcad43) [0x7f8ececfed43]
/usr/lib/libstdc++.so.6(+0xcae3e) [0x7f8ececfee3e]
/export/mongodb/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0x38c) [0x53cdec]
/export/mongodb/bin/mongod(_ZN5mongo7ManagerD0Ev+0x68) [0x67d0d8]
/export/mongodb/bin/mongod(_ZN5mongo13BackgroundJob3thrEv+0xae) [0x526d4e]
/export/mongodb/bin/mongod(thread_proxy+0x80) [0x815780]
/lib/libpthread.so.0(+0x69ca) [0x7f8ecef4e9ca]
/lib/libc.so.6(clone+0x6d) [0x7f8ece4fd6fd]

Thu Jul 29 18:54:20 dbexit:

Thu Jul 29 18:54:20 [rs Manager] shutdown: going to close listening sockets...
Thu Jul 29 18:54:20 [rs Manager] going to close listening socket: 13
Thu Jul 29 18:54:20 [rs Manager] going to close listening socket: 14
Thu Jul 29 18:54:20 [rs Manager] going to close listening socket: 15
Thu Jul 29 18:54:20 [rs Manager] going to close listening socket: 16
Thu Jul 29 18:54:20 [rs Manager] shutdown: going to flush oplog...
Thu Jul 29 18:54:20 [rs Manager] shutdown: going to close sockets...
Thu Jul 29 18:54:20 [rs Manager] shutdown: waiting for fs preallocator...
Thu Jul 29 18:54:20 [rs Manager] shutdown: closing all files...
Thu Jul 29 18:54:20 [initandlisten] now exiting
Thu Jul 29 18:54:20 dbexit: ; exiting immediately

Thu Jul 29 18:54:20 Got signal: 6 (Aborted).

Thu Jul 29 18:54:20 Backtrace:
0x8000c9 0x7f8ece44aaf0 0x7f8ece44aa75 0x7f8ece44e5c0 0x7f8ece443941 0x52909f 0x528282 0x6e8231 0x6e8781 0x80020b 0x7f8ece44aaf0 0x7f8ece44aa75 0x7f8ece44e5c0 0x7ff57f 0x7f8ececfed16 0x7f8ececfed43 0x7f8ececfee3e 0x53cdec 0x67d0d8 0x526d4e
/export/mongodb/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x8000c9]
/lib/libc.so.6(+0x33af0) [0x7f8ece44aaf0]
/lib/libc.so.6(gsignal+0x35) [0x7f8ece44aa75]
/lib/libc.so.6(abort+0x180) [0x7f8ece44e5c0]
/lib/libc.so.6(__assert_fail+0xf1) [0x7f8ece443941]
/export/mongodb/bin/mongod(_ZN5mongo6rwlockD1Ev+0xef) [0x52909f]
/export/mongodb/bin/mongod(_ZN5mongo9MongoFile13closeAllFilesERSt18basic_stringstreamIcSt11char_traitsIcESaIcEE+0x192) [0x528282]
/export/mongodb/bin/mongod(_ZN5mongo8shutdownEv+0x501) [0x6e8231]
/export/mongodb/bin/mongod(_ZN5mongo6dbexitENS_8ExitCodeEPKc+0x161) [0x6e8781]
/export/mongodb/bin/mongod(_ZN5mongo10abruptQuitEi+0x4db) [0x80020b]
/lib/libc.so.6(+0x33af0) [0x7f8ece44aaf0]
/lib/libc.so.6(gsignal+0x35) [0x7f8ece44aa75]
/lib/libc.so.6(abort+0x180) [0x7f8ece44e5c0]
/export/mongodb/bin/mongod(_ZN5mongo11myterminateEv+0x10f) [0x7ff57f]
/usr/lib/libstdc++.so.6(+0xcad16) [0x7f8ececfed16]
/usr/lib/libstdc++.so.6(+0xcad43) [0x7f8ececfed43]
/usr/lib/libstdc++.so.6(+0xcae3e) [0x7f8ececfee3e]
/export/mongodb/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0x38c) [0x53cdec]
/export/mongodb/bin/mongod(_ZN5mongo7ManagerD0Ev+0x68) [0x67d0d8]
/export/mongodb/bin/mongod(_ZN5mongo13BackgroundJob3thrEv+0xae) [0x526d4e]

Comment by Eliot Horowitz (Inactive) [ 29/Jul/10 ]

If you reproduce - please re-open

Comment by Alex [ 28/Jul/10 ]

http://github.com/mongodb/mongo/commit/26828e12a959a3d293d11c342f88c2ea90d52000

db/repl/rs.h
...
log() << "should never be called?" << rsLog;
...

The removed "manager should never be called' was visible on the command line in those cases.

I'll update the cluster to nightly tomorrow and check if the problem persists.

Comment by Kyle Banker [ 27/Jul/10 ]

See previous comment. Tested in jstests and unable to reproduce.

Comment by auto [ 27/Jul/10 ]

Author:

{'login': 'banker', 'name': 'Kyle Banker', 'email': 'kylebanker@gmail.com'}

Message: rs test updates SERVER-1483
http://github.com/mongodb/mongo/commit/79d5c94fd79aa2ca1eb2d50f2cc3df7a3fe741a5

Comment by Kyle Banker [ 27/Jul/10 ]

I have not been able to reproduce this case. There are two tests that kill master.

jstests/replsets/replset1.js kills the master mongod from the command line.

jstests/replsets/replset3.js kills the master mongod using the replicaSetStepDown command.

Neither seems to crash the slaves.

Comment by Eliot Horowitz (Inactive) [ 25/Jul/10 ]

kyle - can you add a test for this.
when you have a test that fiels, comment out failing part, commit and assign back to me

Generated at Thu Feb 08 02:57:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.