Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9326

Node became SECONDARY in the middle of compact command.

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.2.3
    • Component/s: Replication
    • Labels:
      None
    • ALL

      The compact command was issued on primary. Node went into RECOVERING state as expected. But shortly after that, while still doing compact, it somehow became SECONDARY. We run compact command weekly by cron and we have never seen such behavior before.

      Replica set config:

      {
        "_id": "RS_Crawler_2",
        "version": 190441,
        "members": [
          {
            "_id": 0,
            "host": "crwlrdb04:8086",
            "priority": 10
          },
          {
            "_id": 1,
            "host": "crwlrdb05:8086"
          },
          {
            "_id": 2,
            "host": "crwlrdb06:8086",
            "arbiterOnly": true
          },
          {
            "_id": 3,
            "host": "backup.ec2:18089",
            "priority": 0,
            "slaveDelay": 21600,
            "hidden": true,
            "buildIndexes": false
          }
        ]
      }
      

      Log of crwlrdb04 (node performing compact):

      Sun Apr  7 04:00:11 [conn548282] replSet info stepping down as primary secs=600
      Sun Apr  7 04:00:11 [conn548282] replSet relinquishing primary state
      Sun Apr  7 04:00:11 [conn548282] replSet SECONDARY
      Sun Apr  7 04:00:11 [conn548282] replSet closing client sockets after relinquishing primary
      Sun Apr  7 04:00:11 [conn548283] replSet going into maintenance mode (-3 other tasks)
      Sun Apr  7 04:00:11 [conn548283] replSet RECOVERING
      Sun Apr  7 04:00:11 [conn548282] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.100.0.28:59852] 
      Sun Apr  7 04:00:11 [conn410325] SocketException handling request, closing client connection: 9001 socket exception [2] server [xxx.xxx.xxx.221:56357] 
      Sun Apr  7 04:00:11 [conn548283] compact crawlerDb.itemContents begin
      Sun Apr  7 04:00:11 [conn548283] paddingFactor:1 paddingBytes:0
      Sun Apr  7 04:00:12 [conn548283] compact 37 extents
      Sun Apr  7 04:00:12 [conn548283] compact orphan deleted lists
      Sun Apr  7 04:00:12 [conn548283] compact dropping indexes
      Sun Apr  7 04:00:12 [conn548283] compact begin extent #0 for namespace crawlerDb.itemContents
      Sun Apr  7 04:00:12 [conn548283] compact paging in len=0.004096MB
      Sun Apr  7 04:00:12 [conn548283] compact copying records
      Sun Apr  7 04:00:12 [conn548283] compact finished extent #0 containing 2 documents (0.003576MB) oldPadding: 1.08226 1
      Sun Apr  7 04:00:12 [conn548283] compact begin extent #1 for namespace crawlerDb.itemContents
      Sun Apr  7 04:00:12 [conn548283] compact paging in len=0.831488MB
      Sun Apr  7 04:00:12 [conn548283] compact copying records
      Sun Apr  7 04:00:12 [rsHealthPoll] replSet info backup.ec2:18089 is down (or slow to respond): socket exception [SEND_ERROR] for xxx.xxx.xxx.239:18089
      Sun Apr  7 04:00:12 [rsHealthPoll] replSet member backup.ec2:18089 is now in state DOWN
      Sun Apr  7 04:00:13 [rsHealthPoll] replSet info crwlrdb06:8086 is down (or slow to respond): socket exception [SEND_ERROR] for 10.101.0.72:8086
      Sun Apr  7 04:00:13 [rsHealthPoll] replSet member crwlrdb06:8086 is now in state DOWN
      Sun Apr  7 04:00:14 [conn430592] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.101.0.71:33255] 
      Sun Apr  7 04:00:14 [conn548283] compact finished extent #1 containing 43 documents (0.828596MB) oldPadding: 1.00007 1
      Sun Apr  7 04:00:14 [conn548283] compact begin extent #2 for namespace crawlerDb.itemContents
      Sun Apr  7 04:00:14 [conn548283] compact paging in len=3.32595MB
      Sun Apr  7 04:00:14 [rsSync] replSet SECONDARY
      Sun Apr  7 04:00:14 [conn548283] compact copying records
      Sun Apr  7 04:00:14 [conn548283] compact finished extent #2 containing 151 documents (3.32295MB) oldPadding: 1.0001 1
      Sun Apr  7 04:00:14 [conn548283] compact begin extent #3 for namespace crawlerDb.itemContents
      Sun Apr  7 04:00:14 [conn548283] compact paging in len=12.7427MB
      

      Log of crwlrdb05:

      Sun Apr  7 04:00:11 [rsHealthPoll] replSet member crwlrdb04:8086 is now in state RECOVERING
      Sun Apr  7 04:00:12 [rsMgr] not electing self, backup.ec2:18089 would veto
      Sun Apr  7 04:00:12 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: crwlrdb04:8086
      Sun Apr  7 04:00:15 [rsHealthPoll] replSet member crwlrdb04:8086 is now in state SECONDARY
      Sun Apr  7 04:00:16 [rsMgr] replSet info electSelf 1
      Sun Apr  7 04:00:16 [rsMgr] replSet PRIMARY
      Sun Apr  7 04:00:17 [conn913928] query local.oplog.rs query: { ts: { $gte: new Date(5864049901565902853) } } cursorid:3635176764573440955 ntoreturn:0 ntoskip:0 
      nscanned:2 keyUpdates:0 locks(micros) r:104450 nreturned:1 reslen:217 104ms
      Sun Apr  7 04:00:25 [conn913936] no current chunk manager found for this shard, will initialize
      Sun Apr  7 04:00:25 [conn913936] setShardVersion - relocking slow: 96
      ...
      Sun Apr  7 04:10:17 [conn914483] replSet info stepping down as primary secs=1
      Sun Apr  7 04:10:17 [conn914483] replSet relinquishing primary state
      Sun Apr  7 04:10:17 [conn914483] replSet SECONDARY
      Sun Apr  7 04:10:17 [conn914483] replSet closing client sockets after relinquishing primary
      Sun Apr  7 04:10:18 [rsHealthPoll] replSet info crwlrdb06:8086 is down (or slow to respond): socket exception [SEND_ERROR] for 10.101.0.72:8086
      Sun Apr  7 04:10:18 [rsHealthPoll] replSet member crwlrdb06:8086 is now in state DOWN
      Sun Apr  7 04:10:18 [rsMgr] replSet info electSelf 1
      Sun Apr  7 04:10:18 [rsMgr] replSet couldn't elect self, only received -19999 votes
      Sun Apr  7 04:10:19 [conn913928] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.101.0.70:38568] 
      Sun Apr  7 04:10:20 [rsHealthPoll] replSet member crwlrdb06:8086 is up
      Sun Apr  7 04:10:20 [rsHealthPoll] replSet member crwlrdb06:8086 is now in state ARBITER
      Sun Apr  7 04:10:20 [rsMgr] not electing self, backup.ec2:18089 would veto
      Sun Apr  7 04:10:20 [ReplicaSetMonitorWatcher] Socket say send() errno:9 Bad file descriptor 10.101.0.25:8086
      Sun Apr  7 04:10:20 [ReplicaSetMonitorWatcher] Socket say send() errno:9 Bad file descriptor 10.101.0.24:8086
      Sun Apr  7 04:10:20 [ReplicaSetMonitorWatcher] trying reconnect to crwlrdb02:8086
      Sun Apr  7 04:10:20 [ReplicaSetMonitorWatcher] reconnect crwlrdb02:8086 ok
      Sun Apr  7 04:10:20 [ReplicaSetMonitorWatcher] Socket say send() errno:9 Bad file descriptor 10.101.0.25:8086
      Sun Apr  7 04:10:22 [conn914484] replSet info voting yea for crwlrdb04:8086 (0)
      Sun Apr  7 04:10:26 [rsMgr] not electing self, backup.ec2:18089 would veto
      Sun Apr  7 04:10:30 [ReplicaSetMonitorWatcher] trying reconnect to crwlrdb01:8086
      Sun Apr  7 04:10:30 [ReplicaSetMonitorWatcher] reconnect crwlrdb01:8086 ok
      Sun Apr  7 04:10:30 [ReplicaSetMonitorWatcher] Socket say send() errno:9 Bad file descriptor 10.101.0.24:8086
      Sun Apr  7 04:10:32 [rsMgr] not electing self, backup.ec2:18089 would veto
      Sun Apr  7 04:10:37 [rsMgr] not electing self, backup.ec2:18089 would veto
      Sun Apr  7 04:10:38 [LockPinger] Socket say send() errno:9 Bad file descriptor 10.101.0.43:8086
      Sun Apr  7 04:10:38 [LockPinger] Socket say send() errno:9 Bad file descriptor 10.101.0.44:8086
      Sun Apr  7 04:10:38 [LockPinger] Socket say send() errno:9 Bad file descriptor 10.101.0.45:8086
      Sun Apr  7 04:10:38 [LockPinger] scoped connection to crwlrcfgdb01:8086,crwlrcfgdb02:8086,crwlrcfgdb03:8086 not being returned to the pool
      Sun Apr  7 04:10:38 [LockPinger] warning: distributed lock pinger 'crwlrcfgdb01:8086,crwlrcfgdb02:8086,crwlrcfgdb03:8086/crwlrdb05.prod.dal.us:8086:1361707225:275889730' detected an exception while pinging. :: caused by :: SyncClusterConnection::udpate prepare failed: 9001 socket exception [2] server [10.101.0.43:8086]  crwlrcfgdb01:8086:{}9001 socket exception [2] server [10.101.0.44:8086]  crwlrcfgdb02:8086:{}9001 socket exception [2] server [10.101.0.45:8086]  crwlrcfgdb03:8086:{}
      Sun Apr  7 04:10:42 [rsMgr] not electing self, backup.ec2:18089 would veto
      Sun Apr  7 04:10:47 [rsMgr] not electing self, backup.ec2:18089 would veto
      Sun Apr  7 04:10:52 [rsMgr] not electing self, backup.ec2:18089 would veto
      Sun Apr  7 04:10:57 [rsMgr] not electing self, backup.ec2:18089 would veto
      Sun Apr  7 04:11:02 [rsMgr] not electing self, backup.ec2:18089 would veto
      Sun Apr  7 04:11:07 [rsMgr] not electing self, backup.ec2:18089 would veto
      ...
      

      Log of crwlrdb06 (arbiter node):

      Sun Apr  7 04:00:12 [rsHealthPoll] replSet member crwlrdb04:8086 is now in state RECOVERING
      Sun Apr  7 04:00:12 [rsMgr] replSet I don't see a primary and I can't elect myself
      Sun Apr  7 04:00:14 [rsHealthPoll] replSet member crwlrdb04:8086 is now in state SECONDARY
      Sun Apr  7 04:00:16 [conn890924] replSet info voting yea for crwlrdb05:8086 (1)
      Sun Apr  7 04:00:18 [rsHealthPoll] replSet member crwlrdb05:8086 is now in state PRIMARY
      Sun Apr  7 04:10:17 [rsMgr] stepping down crwlrdb05:8086 (priority 1), crwlrdb04:8086 is priority 10 and 0 seconds behind
      Sun Apr  7 04:10:18 [rsHealthPoll] replSet member crwlrdb05:8086 is now in state SECONDARY
      Sun Apr  7 04:10:18 [rsMgr] stepping down crwlrdb05:8086 (priority 1), crwlrdb04:8086 is priority 10 and 0 seconds behind
      Sun Apr  7 04:10:18 [rsMgr] stepping down crwlrdb05:8086 failed: { errmsg: "not primary so can't step down", ok: 0.0 }
      Sun Apr  7 04:10:22 [conn891051] replSet info voting yea for crwlrdb04:8086 (0)
      Sun Apr  7 04:11:23 [rsMgr] replSet I don't see a primary and I can't elect myself
      Sun Apr  7 04:12:38 [rsMgr] replSet I don't see a primary and I can't elect myself
      Sun Apr  7 04:13:53 [rsMgr] replSet I don't see a primary and I can't elect myself
      Sun Apr  7 04:15:08 [rsMgr] replSet I don't see a primary and I can't elect myself
      Sun Apr  7 04:16:23 [rsMgr] replSet I don't see a primary and I can't elect myself
      ...
      Sun Apr  7 04:40:10 [rsHealthPoll] replSet member crwlrdb04:8086 is now in state PRIMARY
      Mon Apr  8 04:49:15 [rsHealthPoll] replSet member crwlrdb05:8086 is now in state RECOVERING
      

      Why it matters?

      crwlrdb04, the node under compact, has highest priority in replica set. When it wrongly reported its status as SECONDARY (though still doing compact command), other PRIMARY was stepped down and all nodes voted for crwlrdb04 to become a primary. But since it was busy with compact, it actually became a primary only in 40 minutes. During this 40 minutes replica set had no primary at all and was nonoperational.

            Assignee:
            milkie@mongodb.com Eric Milkie
            Reporter:
            silver aleksey
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: