Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-12098

node with votes:0 can get involved in election and result in abnormal state

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: 2.4.5
    • Fix Version/s: None
    • Component/s: Replication
    • Labels:
    • Operating System:
      ALL

      Description

      In some situations (possible racy) a node with no votes gets to somehow participate in election which causes it to get to places in the code where it shouldn't be.

      Including logs showing this happening during ad-hoc testing, trying to determine sequence of events to create a reproducible test.

      // rs.conf()
      (2.4.5) > rs.conf()
      {
      	"_id" : "test",
      	"version" : 4,
      	"members" : [
      		{
      			"_id" : 0,
      			"host" : "asyasmacbook.local:40001"
      		},
      		{
      			"_id" : 1,
      			"host" : "asyasmacbook.local:40002",
      			"votes" : 0,
      			"priority" : 0,
      			"slaveDelay" : 600,
      			"hidden" : true
      		}
      	]
      }
       
      // logLevel2 logging from votes:0 node
      Sat Dec 14 14:36:11.152 [repl prefetch worker] opening db:  test
      Sat Dec 14 14:36:11.154 [rsSync] replication batch size is 1
      Sat Dec 14 14:36:11.156 [rsSync] replication batch size is 99
      Sat Dec 14 14:36:12.151 [rsSyncNotifier] replset setting oplog notifier to asyasmacbook.local:40001
      Sat Dec 14 14:36:12.151 BackgroundJob starting: ConnectBG
      Sat Dec 14 14:36:12.152 [rsSyncNotifier] replHandshake res not: 1 res: { ok: 1.0 }
      Sat Dec 14 14:36:12.152 [rsSyncNotifier] repl: local.oplog.rs.find({ ts: { $gte: Timestamp 1387059251000|1 } })
      Sat Dec 14 14:36:12.461 [conn1] runQuery called admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:12.461 [conn1] run command admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:12.461 [conn1] command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:12.461 [conn1] command admin.$cmd command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" } ntoreturn:1 keyUpdates:0  reslen:196 0ms
      Sat Dec 14 14:36:14.461 [conn1] runQuery called admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:14.461 [conn1] run command admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:14.461 [conn1] command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:14.461 [conn1] command admin.$cmd command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" } ntoreturn:1 keyUpdates:0  reslen:196 0ms
      Sat Dec 14 14:36:14.462 [conn1] runQuery called admin.$cmd { replSetFresh: 1, set: "test", opTime: new Date(5957374120659255297), who: "asyasmacbook.local:40001", cfgver: 4, id: 0 }
      Sat Dec 14 14:36:14.462 [conn1] run command admin.$cmd { replSetFresh: 1, set: "test", opTime: new Date(5957374120659255297), who: "asyasmacbook.local:40001", cfgver: 4, id: 0 }
      Sat Dec 14 14:36:14.462 [conn1] command: { replSetFresh: 1, set: "test", opTime: new Date(5957374120659255297), who: "asyasmacbook.local:40001", cfgver: 4, id: 0 }
      Sat Dec 14 14:36:14.462 [conn1] command admin.$cmd command: { replSetFresh: 1, set: "test", opTime: new Date(5957374120659255297), who: "asyasmacbook.local:40001", cfgver: 4, id: 0 } ntoreturn:1 keyUpdates:0  reslen:70 0ms
      Sat Dec 14 14:36:14.462 [conn1] runQuery called admin.$cmd { replSetElect: 1, set: "test", who: "asyasmacbook.local:40001", whoid: 0, cfgver: 4, round: ObjectId('52acdd5e474a3f6203328665') }
      Sat Dec 14 14:36:14.462 [conn1] run command admin.$cmd { replSetElect: 1, set: "test", who: "asyasmacbook.local:40001", whoid: 0, cfgver: 4, round: ObjectId('52acdd5e474a3f6203328665') }
      Sat Dec 14 14:36:14.462 [conn1] command: { replSetElect: 1, set: "test", who: "asyasmacbook.local:40001", whoid: 0, cfgver: 4, round: ObjectId('52acdd5e474a3f6203328665') }
      Sat Dec 14 14:36:14.462 [conn1] replSet received elect msg { replSetElect: 1, set: "test", who: "asyasmacbook.local:40001", whoid: 0, cfgver: 4, round: ObjectId('52acdd5e474a3f6203328665') }
      Sat Dec 14 14:36:14.462 [conn1] replSet attempting to relinquish
      Sat Dec 14 14:36:14.462 [conn1] replSet info voting yea for asyasmacbook.local:40001 (0)
      Sat Dec 14 14:36:14.462 [conn1] command admin.$cmd command: { replSetElect: 1, set: "test", who: "asyasmacbook.local:40001", whoid: 0, cfgver: 4, round: ObjectId('52acdd5e474a3f6203328665') } ntoreturn:1 keyUpdates:0 locks(micros) W:16 reslen:66 0ms
       
      // logLevel2 logs from other node:
      Sat Dec 14 14:36:11.151 [initandlisten] connection accepted from 127.0.0.1:53165 #3 (2 connections now open)
      Sat Dec 14 14:36:11.151 [conn3] runQuery called local.oplog.rs {}
      Sat Dec 14 14:36:11.152 [conn3] query local.oplog.rs ntoreturn:1 ntoskip:0 nscanned:1 keyUpdates:0 locks(micros) r:55 nreturned:1 reslen:106 0ms
      Sat Dec 14 14:36:11.152 [conn3] runQuery called local.oplog.rs { ts: { $gte: Timestamp 1387058405000|1 } }
      Sat Dec 14 14:36:11.152 [conn3] query local.oplog.rs query: { ts: { $gte: Timestamp 1387058405000|1 } } cursorid:35061558897211 ntoreturn:0 ntoskip:0 nscanned:102 keyUpdates:0 locks(micros) r:272 nreturned:101 reslen:8706 0ms
      Sat Dec 14 14:36:12.151 [initandlisten] connection accepted from 127.0.0.1:53166 #4 (3 connections now open)
      Sat Dec 14 14:36:12.152 [conn4] runQuery called admin.$cmd { handshake: ObjectId('52accb5e22048edc1a5ef6e7'), member: 1, config: { _id: 1, host: "asyasmacbook.local:40002", votes: 0, priority: 0.0, slaveDelay: 600, hidden: true } }
      Sat Dec 14 14:36:12.152 [conn4] run command admin.$cmd { handshake: ObjectId('52accb5e22048edc1a5ef6e7'), member: 1, config: { _id: 1, host: "asyasmacbook.local:40002", votes: 0, priority: 0.0, slaveDelay: 600, hidden: true } }
      Sat Dec 14 14:36:12.152 [conn4] command admin.$cmd command: { handshake: ObjectId('52accb5e22048edc1a5ef6e7'), member: 1, config: { _id: 1, host: "asyasmacbook.local:40002", votes: 0, priority: 0.0, slaveDelay: 600, hidden: true } } ntoreturn:1 keyUpdates:0  reslen:37 0ms
      Sat Dec 14 14:36:12.152 [conn4] runQuery called local.oplog.rs { ts: { $gte: Timestamp 1387059251000|1 } }
      Sat Dec 14 14:36:12.152 [conn4] query local.oplog.rs query: { ts: { $gte: Timestamp 1387059251000|1 } } cursorid:39102337507168 ntoreturn:0 ntoskip:0 nscanned:2 keyUpdates:0 locks(micros) r:108 nreturned:1 reslen:37 0ms
      Sat Dec 14 14:36:12.152 BackgroundJob starting: SlaveTracking
      Sat Dec 14 14:36:12.152 [conn4] percolating 52acd833:1 from { _id: ObjectId('52accb5e22048edc1a5ef6e7') }
      Sat Dec 14 14:36:12.152 [rsGhostSync] replica set ghost target no good
      Sat Dec 14 14:36:12.161 [conn2] runQuery called admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:12.161 [conn2] run command admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:12.161 [conn2] command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:12.161 [conn2] command admin.$cmd command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" } ntoreturn:1 keyUpdates:0  reslen:120 0ms
      Sat Dec 14 14:36:13.153 [slaveTracking]   running multiple plans
      Sat Dec 14 14:36:13.153 [slaveTracking] User Assertion: 11000:E11000 duplicate key error index: local.slaves.$_id_  dup key: { : ObjectId('52accb5e22048edc1a5ef6e7') }
      Sat Dec 14 14:36:13.153 [slaveTracking] update local.slaves query: { _id: ObjectId('52accb5e22048edc1a5ef6e7'), config: { _id: 1, host: "asyasmacbook.local:40002", votes: 0, priority: 0.0, slaveDelay: 600, hidden: true }, ns: "local.oplog.rs" } update: { $set: { syncedTo: Timestamp 1387059251000|1 } } nscanned:2 fastmodinsert:1 keyUpdates:0 exception: E11000 duplicate key error index: local.slaves.$_id_  dup key: { : ObjectId('52accb5e22048edc1a5ef6e7') } code:11000 locks(micros) w:587 0ms
      Sat Dec 14 14:36:14.162 [conn2] runQuery called admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:14.162 [conn2] run command admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:14.162 [conn2] command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:14.162 [conn2] command admin.$cmd command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" } ntoreturn:1 keyUpdates:0  reslen:120 0ms
      Sat Dec 14 14:36:14.462 BackgroundJob starting: MultiCommandJob
      Sat Dec 14 14:36:14.462 [rsMgr] replSet dev we are freshest of up nodes, nok:1 nTies:1
      Sat Dec 14 14:36:14.462 [rsMgr] replSet info electSelf 0
      Sat Dec 14 14:36:14.462 BackgroundJob starting: MultiCommandJob
      Sat Dec 14 14:36:14.462 [rsMgr] replSet election succeeded, assuming primary role
      Sat Dec 14 14:36:14.462 [rsMgr] replSet assuming primary
      Sat Dec 14 14:36:14.462 [rsMgr] replSet waiting for replication to finish before becoming primary
      Sat Dec 14 14:36:14.465 [rsMgr] runQuery called local.oplog.rs { query: {}, orderby: { $natural: -1 } }
      Sat Dec 14 14:36:14.465 [rsMgr] query local.oplog.rs query: { query: {}, orderby: { $natural: -1 } } ntoreturn:1 ntoskip:0 nscanned:1 keyUpdates:0  nreturned:1 reslen:106 0ms
      Sat Dec 14 14:36:14.465 [rsMgr] replSet PRIMARY
      Sat Dec 14 14:36:15.155 [conn3] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1387058405000|1 } } cursorid:35061558897211 ntoreturn:0 keyUpdates:0 locks(micros) r:3260 nreturned:0 reslen:20 4002ms

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              matt.dannenberg Matt Dannenberg
              Reporter:
              asya Asya Kamsky
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: