Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-10621

Replication failover hangs indefinitely when priorities conflict with replication

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major - P3
    • Resolution: Fixed
    • 2.4.6
    • 3.2.0
    • Replication
    • Ubuntu 12.04LTS
    • Fully Compatible
    • ALL

    Description

      This is a continuation of SERVER-9464 which was (apparently) not really fixed, see the original case.

      This appears to be a deadlock between priority and freshness.
      When perusing the original case's repro script after db0 is killed (Waiting for db1 to become primary):
      db1 log is filled with:

      [rsMgr] not electing self, we are not freshest

      db2 log is filled with:

      [rsMgr] not electing self, 127.0.0.1:30001 would veto with '127.0.0.1:30002 has lower priority than 127.0.0.1:30001'

      I guess the load generator timing is important to reproducability. If db2 replicates from db0 faster than db1 does, it always gets into this deadlock.

      test:SECONDARY> rs.status()
      {
       "set" : "test",
       "date" : ISODate("2013-08-26T12:43:18Z"),
       "myState" : 2,
       "members" : [
       { "_id" : 0, "name" : "127.0.0.1:30000", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "optime" : Timestamp(1377520302, 6340), "optimeDate" : ISODate("2013-08-26T12:31:42Z"), "lastHeartbeat" : ISODate("2013-08-26T12:43:17Z"), "lastHeartbeatRecv" : ISODate("2013-08-26T12:31:42Z"), "pingMs" : 0 },
       { "_id" : 1, "name" : "127.0.0.1:30001", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 730, "optime" : Timestamp(1377520303, 6044), "optimeDate" : ISODate("2013-08-26T12:31:43Z"), "self" : true },
      { "_id" : 2, "name" : "127.0.0.1:30002", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 718, "optime" : Timestamp(1377520303, 6045), "optimeDate" : ISODate("2013-08-26T12:31:43Z"), "lastHeartbeat" : ISODate("2013-08-26T12:43:17Z"), "lastHeartbeatRecv" : ISODate("2013-08-26T12:43:17Z"), "pingMs" : 0, "syncingTo" : "127.0.0.1:30000" } 
      ],
       "ok" : 1
      }

      Also, while this might be unrelated, there are following entries in the log of db1:

      [rsSyncNotifier] replset setting oplog notifier to 127.0.0.1:30000
      [rsSyncNotifier] Socket flush send() errno:9 Bad file descriptor 127.0.0.1:30000
      [rsSyncNotifier] caught exception (socket exception [SEND_ERROR] for 127.0.0.1:30000) in destructor (~PiggyBackData)

      I'm not sure what's your policy on exceptions in destructors of course, but this might be an indicator of an improperly-cleaned up connection.

      Attachments

        1. loadgen.rb
          0.4 kB
          Aristarkh Zagorodnikov
        2. test.sh
          3 kB
          Aristarkh Zagorodnikov

        Issue Links

          Activity

            People

              backlog-server-repl Backlog - Replication Team
              onyxmaster Aristarkh Zagorodnikov
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: