Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-12098

node with votes:0 can get involved in election and result in abnormal state

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • None
    • Affects Version/s: 2.4.5
    • Component/s: Replication
    • Labels:
    • ALL

      In some situations (possible racy) a node with no votes gets to somehow participate in election which causes it to get to places in the code where it shouldn't be.

      Including logs showing this happening during ad-hoc testing, trying to determine sequence of events to create a reproducible test.

      // rs.conf()
      (2.4.5) > rs.conf()
      {
      	"_id" : "test",
      	"version" : 4,
      	"members" : [
      		{
      			"_id" : 0,
      			"host" : "asyasmacbook.local:40001"
      		},
      		{
      			"_id" : 1,
      			"host" : "asyasmacbook.local:40002",
      			"votes" : 0,
      			"priority" : 0,
      			"slaveDelay" : 600,
      			"hidden" : true
      		}
      	]
      }
      
      // logLevel2 logging from votes:0 node
      Sat Dec 14 14:36:11.152 [repl prefetch worker] opening db:  test
      Sat Dec 14 14:36:11.154 [rsSync] replication batch size is 1
      Sat Dec 14 14:36:11.156 [rsSync] replication batch size is 99
      Sat Dec 14 14:36:12.151 [rsSyncNotifier] replset setting oplog notifier to asyasmacbook.local:40001
      Sat Dec 14 14:36:12.151 BackgroundJob starting: ConnectBG
      Sat Dec 14 14:36:12.152 [rsSyncNotifier] replHandshake res not: 1 res: { ok: 1.0 }
      Sat Dec 14 14:36:12.152 [rsSyncNotifier] repl: local.oplog.rs.find({ ts: { $gte: Timestamp 1387059251000|1 } })
      Sat Dec 14 14:36:12.461 [conn1] runQuery called admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:12.461 [conn1] run command admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:12.461 [conn1] command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:12.461 [conn1] command admin.$cmd command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" } ntoreturn:1 keyUpdates:0  reslen:196 0ms
      Sat Dec 14 14:36:14.461 [conn1] runQuery called admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:14.461 [conn1] run command admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:14.461 [conn1] command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" }
      Sat Dec 14 14:36:14.461 [conn1] command admin.$cmd command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40001" } ntoreturn:1 keyUpdates:0  reslen:196 0ms
      Sat Dec 14 14:36:14.462 [conn1] runQuery called admin.$cmd { replSetFresh: 1, set: "test", opTime: new Date(5957374120659255297), who: "asyasmacbook.local:40001", cfgver: 4, id: 0 }
      Sat Dec 14 14:36:14.462 [conn1] run command admin.$cmd { replSetFresh: 1, set: "test", opTime: new Date(5957374120659255297), who: "asyasmacbook.local:40001", cfgver: 4, id: 0 }
      Sat Dec 14 14:36:14.462 [conn1] command: { replSetFresh: 1, set: "test", opTime: new Date(5957374120659255297), who: "asyasmacbook.local:40001", cfgver: 4, id: 0 }
      Sat Dec 14 14:36:14.462 [conn1] command admin.$cmd command: { replSetFresh: 1, set: "test", opTime: new Date(5957374120659255297), who: "asyasmacbook.local:40001", cfgver: 4, id: 0 } ntoreturn:1 keyUpdates:0  reslen:70 0ms
      Sat Dec 14 14:36:14.462 [conn1] runQuery called admin.$cmd { replSetElect: 1, set: "test", who: "asyasmacbook.local:40001", whoid: 0, cfgver: 4, round: ObjectId('52acdd5e474a3f6203328665') }
      Sat Dec 14 14:36:14.462 [conn1] run command admin.$cmd { replSetElect: 1, set: "test", who: "asyasmacbook.local:40001", whoid: 0, cfgver: 4, round: ObjectId('52acdd5e474a3f6203328665') }
      Sat Dec 14 14:36:14.462 [conn1] command: { replSetElect: 1, set: "test", who: "asyasmacbook.local:40001", whoid: 0, cfgver: 4, round: ObjectId('52acdd5e474a3f6203328665') }
      Sat Dec 14 14:36:14.462 [conn1] replSet received elect msg { replSetElect: 1, set: "test", who: "asyasmacbook.local:40001", whoid: 0, cfgver: 4, round: ObjectId('52acdd5e474a3f6203328665') }
      Sat Dec 14 14:36:14.462 [conn1] replSet attempting to relinquish
      Sat Dec 14 14:36:14.462 [conn1] replSet info voting yea for asyasmacbook.local:40001 (0)
      Sat Dec 14 14:36:14.462 [conn1] command admin.$cmd command: { replSetElect: 1, set: "test", who: "asyasmacbook.local:40001", whoid: 0, cfgver: 4, round: ObjectId('52acdd5e474a3f6203328665') } ntoreturn:1 keyUpdates:0 locks(micros) W:16 reslen:66 0ms
      
      // logLevel2 logs from other node:
      Sat Dec 14 14:36:11.151 [initandlisten] connection accepted from 127.0.0.1:53165 #3 (2 connections now open)
      Sat Dec 14 14:36:11.151 [conn3] runQuery called local.oplog.rs {}
      Sat Dec 14 14:36:11.152 [conn3] query local.oplog.rs ntoreturn:1 ntoskip:0 nscanned:1 keyUpdates:0 locks(micros) r:55 nreturned:1 reslen:106 0ms
      Sat Dec 14 14:36:11.152 [conn3] runQuery called local.oplog.rs { ts: { $gte: Timestamp 1387058405000|1 } }
      Sat Dec 14 14:36:11.152 [conn3] query local.oplog.rs query: { ts: { $gte: Timestamp 1387058405000|1 } } cursorid:35061558897211 ntoreturn:0 ntoskip:0 nscanned:102 keyUpdates:0 locks(micros) r:272 nreturned:101 reslen:8706 0ms
      Sat Dec 14 14:36:12.151 [initandlisten] connection accepted from 127.0.0.1:53166 #4 (3 connections now open)
      Sat Dec 14 14:36:12.152 [conn4] runQuery called admin.$cmd { handshake: ObjectId('52accb5e22048edc1a5ef6e7'), member: 1, config: { _id: 1, host: "asyasmacbook.local:40002", votes: 0, priority: 0.0, slaveDelay: 600, hidden: true } }
      Sat Dec 14 14:36:12.152 [conn4] run command admin.$cmd { handshake: ObjectId('52accb5e22048edc1a5ef6e7'), member: 1, config: { _id: 1, host: "asyasmacbook.local:40002", votes: 0, priority: 0.0, slaveDelay: 600, hidden: true } }
      Sat Dec 14 14:36:12.152 [conn4] command admin.$cmd command: { handshake: ObjectId('52accb5e22048edc1a5ef6e7'), member: 1, config: { _id: 1, host: "asyasmacbook.local:40002", votes: 0, priority: 0.0, slaveDelay: 600, hidden: true } } ntoreturn:1 keyUpdates:0  reslen:37 0ms
      Sat Dec 14 14:36:12.152 [conn4] runQuery called local.oplog.rs { ts: { $gte: Timestamp 1387059251000|1 } }
      Sat Dec 14 14:36:12.152 [conn4] query local.oplog.rs query: { ts: { $gte: Timestamp 1387059251000|1 } } cursorid:39102337507168 ntoreturn:0 ntoskip:0 nscanned:2 keyUpdates:0 locks(micros) r:108 nreturned:1 reslen:37 0ms
      Sat Dec 14 14:36:12.152 BackgroundJob starting: SlaveTracking
      Sat Dec 14 14:36:12.152 [conn4] percolating 52acd833:1 from { _id: ObjectId('52accb5e22048edc1a5ef6e7') }
      Sat Dec 14 14:36:12.152 [rsGhostSync] replica set ghost target no good
      Sat Dec 14 14:36:12.161 [conn2] runQuery called admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:12.161 [conn2] run command admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:12.161 [conn2] command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:12.161 [conn2] command admin.$cmd command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" } ntoreturn:1 keyUpdates:0  reslen:120 0ms
      Sat Dec 14 14:36:13.153 [slaveTracking]   running multiple plans
      Sat Dec 14 14:36:13.153 [slaveTracking] User Assertion: 11000:E11000 duplicate key error index: local.slaves.$_id_  dup key: { : ObjectId('52accb5e22048edc1a5ef6e7') }
      Sat Dec 14 14:36:13.153 [slaveTracking] update local.slaves query: { _id: ObjectId('52accb5e22048edc1a5ef6e7'), config: { _id: 1, host: "asyasmacbook.local:40002", votes: 0, priority: 0.0, slaveDelay: 600, hidden: true }, ns: "local.oplog.rs" } update: { $set: { syncedTo: Timestamp 1387059251000|1 } } nscanned:2 fastmodinsert:1 keyUpdates:0 exception: E11000 duplicate key error index: local.slaves.$_id_  dup key: { : ObjectId('52accb5e22048edc1a5ef6e7') } code:11000 locks(micros) w:587 0ms
      Sat Dec 14 14:36:14.162 [conn2] runQuery called admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:14.162 [conn2] run command admin.$cmd { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:14.162 [conn2] command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" }
      Sat Dec 14 14:36:14.162 [conn2] command admin.$cmd command: { replSetHeartbeat: "test", v: 4, pv: 1, checkEmpty: false, from: "asyasmacbook.local:40002" } ntoreturn:1 keyUpdates:0  reslen:120 0ms
      Sat Dec 14 14:36:14.462 BackgroundJob starting: MultiCommandJob
      Sat Dec 14 14:36:14.462 [rsMgr] replSet dev we are freshest of up nodes, nok:1 nTies:1
      Sat Dec 14 14:36:14.462 [rsMgr] replSet info electSelf 0
      Sat Dec 14 14:36:14.462 BackgroundJob starting: MultiCommandJob
      Sat Dec 14 14:36:14.462 [rsMgr] replSet election succeeded, assuming primary role
      Sat Dec 14 14:36:14.462 [rsMgr] replSet assuming primary
      Sat Dec 14 14:36:14.462 [rsMgr] replSet waiting for replication to finish before becoming primary
      Sat Dec 14 14:36:14.465 [rsMgr] runQuery called local.oplog.rs { query: {}, orderby: { $natural: -1 } }
      Sat Dec 14 14:36:14.465 [rsMgr] query local.oplog.rs query: { query: {}, orderby: { $natural: -1 } } ntoreturn:1 ntoskip:0 nscanned:1 keyUpdates:0  nreturned:1 reslen:106 0ms
      Sat Dec 14 14:36:14.465 [rsMgr] replSet PRIMARY
      Sat Dec 14 14:36:15.155 [conn3] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1387058405000|1 } } cursorid:35061558897211 ntoreturn:0 keyUpdates:0 locks(micros) r:3260 nreturned:0 reslen:20 4002ms
      

            Assignee:
            matt.dannenberg Matt Dannenberg
            Reporter:
            asya.kamsky@mongodb.com Asya Kamsky
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: