[SERVER-1483] Shutdown / rs.stepDown() of master creates random slave crashes Created: 25/Jul/10 Updated: 12/Jul/16 Resolved: 16/Aug/10 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 1.5.6 |
| Fix Version/s: | 1.6.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Alex | Assignee: | Dwight Merriman |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu 10.04 64bit mongodb 1.5.6 |
||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
killing mongod on the master either by "command line kill" or "mongo rs.stepDown()" randomly crashes slaves. not all the time, but a lot of times. slave log: [conn561] Sun Jul 25 00:38:42 replSet TEMP RECEIVED ELECT MSG { replSetElect: 1, set: "<replset>", who: "<master host>", whoid: 2, cfgver: 5, round: ObjectId('4c4b879276b7cf142d74a41f') }[conn561] Sun Jul 25 00:38:42 replSet info voting yea for 2 Sun Jul 25 00:38:43 Backtrace: Sun Jul 25 00:38:43 dbexit: [ReplSetHealthPollTask] Sun Jul 25 00:38:43 shutdown: going to close listening sockets... Sun Jul 25 00:38:43 Got signal: 6 (Aborted). |
| Comments |
| Comment by Alex [ 06/Aug/10 ] |
|
Was not able to reproduce with latest 1.6 nightly. Thanks |
| Comment by Dwight Merriman [ 05/Aug/10 ] |
|
thanks @david. i tried to reproduce but could not myself. There is a "speculative fix" (couldn't verify as couldn't reproduce) that is in 1.6.0 that was not in 1.5.8. |
| Comment by David Mytton [ 05/Aug/10 ] |
|
I have been unable to reproduce the original reported problem on 1.5.8 (my case |
| Comment by Dwight Merriman [ 04/Aug/10 ] |
|
a couple things to possibly try if you can: (1) try the above commit patch (2) a newer version of boost with the old code trying either one and reporting back would be helpful. thanks |
| Comment by auto [ 04/Aug/10 ] |
|
Author: {'login': 'dwight', 'name': 'Dwight', 'email': 'dwight@10gen.com'}Message: trying to fix |
| Comment by auto [ 04/Aug/10 ] |
|
Author: {'login': 'dwight', 'name': 'Dwight', 'email': 'dwight@10gen.com'}Message: trying to fix |
| Comment by Frank Steinborn [ 04/Aug/10 ] |
|
Sure: Wed Aug 4 18:04:22 MongoDB starting : pid=1867 port=27017 dbpath=/var/lib/mongodb 64-bit It's both built an run on Debian Lenny 64 bit. |
| Comment by Dwight Merriman [ 04/Aug/10 ] |
|
i think i have a lead on it now. working on it. |
| Comment by Dwight Merriman [ 04/Aug/10 ] |
|
@frank can you post the header from that log so i have the build version # details and the OS details tx. |
| Comment by Frank Steinborn [ 04/Aug/10 ] |
|
I could not reproduce the crash of a secondary on rs.stepDown(), however, I see random crashes of the arbiter when the master does an rs.stepDown() during heavy write-load (e.g. mongorestore). Please let me know if I should open a new issue for that, but it sounds related... This is with one primary, one secondary and one arbiter on version 1.5.8. Log of the arbiter after rs.stepDown() on primary: [ReplSetHealthPollTask] Wed Aug 4 14:57:17 replSet info ramos.dev.sipgate.net:27017 is now down (or slow to respond) Wed Aug 4 14:57:17 Backtrace: Wed Aug 4 14:57:17 dbexit: [ReplSetHealthPollTask] Wed Aug 4 14:57:17 shutdown: going to close listening sockets... [ReplSetHealthPollTask] Wed Aug 4 14:57:17 shutdown: removing fs lock... Thanks, |
| Comment by Alex [ 04/Aug/10 ] |
|
Will test on weekend |
| Comment by Dwight Merriman [ 02/Aug/10 ] |
|
added some extra logging to help diagnose - please try again with the latest latest code when convenient. |
| Comment by auto [ 02/Aug/10 ] |
|
Author: {'login': 'dwight', 'name': 'Dwight', 'email': 'dwight@10gen.com'}Message: defensive re |
| Comment by Alex [ 31/Jul/10 ] |
|
Still exists in 1.5.7 now as stated above in repl/manager.cpp 55 another dump this time from 32bit node: Sat Jul 31 22:00:41 [conn2] replSet info voting yea for 1 Sat Jul 31 22:00:41 Got signal: 6 (Aborted). Sat Jul 31 22:00:41 Backtrace: Sat Jul 31 22:00:41 dbexit: Sat Jul 31 22:00:41 [rs Manager] shutdown: going to close listening sockets... Sat Jul 31 22:00:41 [rs Manager] shutdown: removing fs lock... |
| Comment by Alex [ 29/Jul/10 ] |
|
rs.stepDown() on master, using latest nightly build: Thu Jul 29 18:54:19 [conn1] replSet info voting yea for 0 Thu Jul 29 18:54:20 Got signal: 6 (Aborted). Thu Jul 29 18:54:20 Backtrace: Thu Jul 29 18:54:20 dbexit: Thu Jul 29 18:54:20 [rs Manager] shutdown: going to close listening sockets... Thu Jul 29 18:54:20 Got signal: 6 (Aborted). Thu Jul 29 18:54:20 Backtrace: |
| Comment by Eliot Horowitz (Inactive) [ 29/Jul/10 ] |
|
If you reproduce - please re-open |
| Comment by Alex [ 28/Jul/10 ] |
|
http://github.com/mongodb/mongo/commit/26828e12a959a3d293d11c342f88c2ea90d52000 db/repl/rs.h The removed "manager should never be called' was visible on the command line in those cases. I'll update the cluster to nightly tomorrow and check if the problem persists. |
| Comment by Kyle Banker [ 27/Jul/10 ] |
|
See previous comment. Tested in jstests and unable to reproduce. |
| Comment by auto [ 27/Jul/10 ] |
|
Author: {'login': 'banker', 'name': 'Kyle Banker', 'email': 'kylebanker@gmail.com'}Message: rs test updates |
| Comment by Kyle Banker [ 27/Jul/10 ] |
|
I have not been able to reproduce this case. There are two tests that kill master. jstests/replsets/replset1.js kills the master mongod from the command line. jstests/replsets/replset3.js kills the master mongod using the replicaSetStepDown command. Neither seems to crash the slaves. |
| Comment by Eliot Horowitz (Inactive) [ 25/Jul/10 ] |
|
kyle - can you add a test for this. |