Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-2899

Replicaset nodes doesn't reconnect after being down while rs.status() on the last started node shows all servers as being up

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 1.8.1
    • Component/s: Replication
    • Labels:
      None
    • Environment:
      FreeBSD 8.2 jail
    • FreeBSD

      I'm testing a replicaset with four mongodb 1.8.1-rc1 with each running in its own jail in FreeBSD 8.2.

      If I shutdown (a clean kill) one primary (a.k.a. mongo1) and one secondary (a.k.a. mongo2), the other two secondaries (a.k.a. mongo3 & mongo4) stays running and notices that the other two went away as they should.
      After restarting the mongo2 server it gets voted to become a primary. All seems to be well (we know mongo1 is still down) when you check rs.status() from mongo2:

      DuegoWeb:PRIMARY> rs.status()
      {
      "set" : "DuegoWeb",
      "date" : ISODate("2011-04-05T08:31:11Z"),
      "myState" : 1,
      "members" : [
      {
      "_id" : 0,
      "name" : "mongo1.lan",
      "health" : 0,
      "state" : 6,
      "stateStr" : "(not reachable/healthy)",
      "uptime" : 0,
      "optime" :

      { "t" : 0, "i" : 0 }

      ,
      "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
      "lastHeartbeat" : ISODate("2011-04-05T08:31:11Z"),
      "errmsg" : "not running with --replSet"
      },
      {
      "_id" : 1,
      "name" : "mongo2.lan",
      "health" : 1,
      "state" : 1,
      "stateStr" : "PRIMARY",
      "optime" :

      { "t" : 1301991284000, "i" : 1 }

      ,
      "optimeDate" : ISODate("2011-04-05T08:14:44Z"),
      "self" : true
      },
      {
      "_id" : 2,
      "name" : "mongo3.lan",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 1485,
      "optime" :

      { "t" : 1301930804000, "i" : 1 }

      ,
      "optimeDate" : ISODate("2011-04-04T15:26:44Z"),
      "lastHeartbeat" : ISODate("2011-04-05T08:31:11Z")
      },
      {
      "_id" : 3,
      "name" : "mongo4.lan",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 1485,
      "optime" :

      { "t" : 1301930804000, "i" : 1 }

      ,
      "optimeDate" : ISODate("2011-04-04T15:26:44Z"),
      "lastHeartbeat" : ISODate("2011-04-05T08:31:11Z")
      }
      ],
      "ok" : 1
      }

      However if we move to mongo3 and also run the rs.status() it says mongo2 isn't available:
      {
      "_id" : 1,
      "name" : "mongo2.lan",
      "health" : 0,
      "state" : 2,
      "stateStr" : "(not reachable/healthy)",
      "uptime" : 0,
      "optime" :

      { "t" : 1301930804000, "i" : 1 }

      ,
      "optimeDate" : ISODate("2011-04-04T15:26:44Z"),
      "lastHeartbeat" : ISODate("2011-04-05T08:30:33Z"),
      "errmsg" : "not running with --replSet"
      },

      I find this confusing that the status() on mongo2 can say that mongo3 is ok, but not vice versa.

      If we then also start up mongo1, the rs.status() on this server says all servers are ok while mongo2 still doesn't show mongo1 as being up:
      DuegoWeb:SECONDARY> rs.status()
      {
      "set" : "DuegoWeb",
      "date" : ISODate("2011-04-05T08:31:30Z"),
      "myState" : 2,
      "members" : [
      {
      "_id" : 0,
      "name" : "mongo1.lan",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "optime" :

      { "t" : 1301991284000, "i" : 1 }

      ,
      "optimeDate" : ISODate("2011-04-05T08:14:44Z"),
      "self" : true
      },
      {
      "_id" : 1,
      "name" : "mongo2.lan",
      "health" : 1,
      "state" : 1,
      "stateStr" : "PRIMARY",
      "uptime" : 64,
      "optime" :

      { "t" : 1301991284000, "i" : 1 }

      ,
      "optimeDate" : ISODate("2011-04-05T08:14:44Z"),
      "lastHeartbeat" : ISODate("2011-04-05T08:31:28Z")
      },
      {
      "_id" : 2,
      "name" : "mongo3.lan",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 64,
      "optime" :

      { "t" : 1301930804000, "i" : 1 }

      ,
      "optimeDate" : ISODate("2011-04-04T15:26:44Z"),
      "lastHeartbeat" : ISODate("2011-04-05T08:31:28Z")
      },
      {
      "_id" : 3,
      "name" : "mongo4.lan",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 64,
      "optime" :

      { "t" : 1301930804000, "i" : 1 }

      ,
      "optimeDate" : ISODate("2011-04-04T15:26:44Z"),
      "lastHeartbeat" : ISODate("2011-04-05T08:31:28Z")
      }
      ],
      "ok" : 1
      }

      The same rs.status() is still shown on mongo2 and mongo3 just like before mongo1 was started again.
      Inserting data on mongo2 doesn't get replicated to mongo3, even when it says its ok and even when mongo3 seem to have been participating in voting on mongo2 for becoming the new primary.

      Sorry if my example is badly explained.

      I'll attach logs and all rs statuses, the order is:

      • start mongo1, mongo2, mongo3, mongo4
      • setup replicaset, check that it replicates and everything works fine, mongo1 is the primary
      • kill mongo1 and mongo2
      • mongo3 and mongo4 stays as secondaries as there are no majority
      • start mongo2
      • mongo2 gets elected as the new primary
      • rs.status() between mongo2 and mongo3 isn't equal. Inserting data on mongo2 doesn't show on mongo3
      • start mongo1
      • mongo1 says all servers are ok, mongo2 and mongo3 still shows the same status as before

      The logs on mongo1 are +2 hours, I corrected the time on this machine later with the same results
      I should also add that the replicaset is specified in a config file like this:
      Mongo1:
      replSet=DuegoWeb
      journal=true
      Mongo2:
      replSet=DuegoWeb/mongo1.lan,mongo2.lan
      Mongo3:
      replSet=DuegoWeb/mongo3.lan,mongo1.lan
      Mongo4:
      replSet=DuegoWeb/mongo4.lan,mongo1.lan

      Everything works and gets in sync as long as I restart the mongodb servers manually, but they never reconnect automatically

        1. replicaset.tgz
          24 kB
        2. replicaset.tgz
          3 kB

            Assignee:
            kristina Kristina Chodorow (Inactive)
            Reporter:
            balboah Johnny Boy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: