[SERVER-2899] Replicaset nodes doesn't reconnect after being down while rs.status() on the last started node shows all servers as being up Created: 05/Apr/11  Updated: 04/Feb/15  Resolved: 28/Oct/11

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Johnny Boy Assignee: Kristina Chodorow (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

FreeBSD 8.2 jail


Attachments: File replicaset.tgz     File replicaset.tgz    
Operating System: FreeBSD
Participants:

 Description   

I'm testing a replicaset with four mongodb 1.8.1-rc1 with each running in its own jail in FreeBSD 8.2.

If I shutdown (a clean kill) one primary (a.k.a. mongo1) and one secondary (a.k.a. mongo2), the other two secondaries (a.k.a. mongo3 & mongo4) stays running and notices that the other two went away as they should.
After restarting the mongo2 server it gets voted to become a primary. All seems to be well (we know mongo1 is still down) when you check rs.status() from mongo2:

DuegoWeb:PRIMARY> rs.status()
{
"set" : "DuegoWeb",
"date" : ISODate("2011-04-05T08:31:11Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "mongo1.lan",
"health" : 0,
"state" : 6,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2011-04-05T08:31:11Z"),
"errmsg" : "not running with --replSet"
},
{
"_id" : 1,
"name" : "mongo2.lan",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" :

{ "t" : 1301991284000, "i" : 1 }

,
"optimeDate" : ISODate("2011-04-05T08:14:44Z"),
"self" : true
},
{
"_id" : 2,
"name" : "mongo3.lan",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 1485,
"optime" :

{ "t" : 1301930804000, "i" : 1 }

,
"optimeDate" : ISODate("2011-04-04T15:26:44Z"),
"lastHeartbeat" : ISODate("2011-04-05T08:31:11Z")
},
{
"_id" : 3,
"name" : "mongo4.lan",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 1485,
"optime" :

{ "t" : 1301930804000, "i" : 1 }

,
"optimeDate" : ISODate("2011-04-04T15:26:44Z"),
"lastHeartbeat" : ISODate("2011-04-05T08:31:11Z")
}
],
"ok" : 1
}

However if we move to mongo3 and also run the rs.status() it says mongo2 isn't available:
{
"_id" : 1,
"name" : "mongo2.lan",
"health" : 0,
"state" : 2,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :

{ "t" : 1301930804000, "i" : 1 }

,
"optimeDate" : ISODate("2011-04-04T15:26:44Z"),
"lastHeartbeat" : ISODate("2011-04-05T08:30:33Z"),
"errmsg" : "not running with --replSet"
},

I find this confusing that the status() on mongo2 can say that mongo3 is ok, but not vice versa.

If we then also start up mongo1, the rs.status() on this server says all servers are ok while mongo2 still doesn't show mongo1 as being up:
DuegoWeb:SECONDARY> rs.status()
{
"set" : "DuegoWeb",
"date" : ISODate("2011-04-05T08:31:30Z"),
"myState" : 2,
"members" : [
{
"_id" : 0,
"name" : "mongo1.lan",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" :

{ "t" : 1301991284000, "i" : 1 }

,
"optimeDate" : ISODate("2011-04-05T08:14:44Z"),
"self" : true
},
{
"_id" : 1,
"name" : "mongo2.lan",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 64,
"optime" :

{ "t" : 1301991284000, "i" : 1 }

,
"optimeDate" : ISODate("2011-04-05T08:14:44Z"),
"lastHeartbeat" : ISODate("2011-04-05T08:31:28Z")
},
{
"_id" : 2,
"name" : "mongo3.lan",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 64,
"optime" :

{ "t" : 1301930804000, "i" : 1 }

,
"optimeDate" : ISODate("2011-04-04T15:26:44Z"),
"lastHeartbeat" : ISODate("2011-04-05T08:31:28Z")
},
{
"_id" : 3,
"name" : "mongo4.lan",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 64,
"optime" :

{ "t" : 1301930804000, "i" : 1 }

,
"optimeDate" : ISODate("2011-04-04T15:26:44Z"),
"lastHeartbeat" : ISODate("2011-04-05T08:31:28Z")
}
],
"ok" : 1
}

The same rs.status() is still shown on mongo2 and mongo3 just like before mongo1 was started again.
Inserting data on mongo2 doesn't get replicated to mongo3, even when it says its ok and even when mongo3 seem to have been participating in voting on mongo2 for becoming the new primary.

Sorry if my example is badly explained.

I'll attach logs and all rs statuses, the order is:

  • start mongo1, mongo2, mongo3, mongo4
  • setup replicaset, check that it replicates and everything works fine, mongo1 is the primary
  • kill mongo1 and mongo2
  • mongo3 and mongo4 stays as secondaries as there are no majority
  • start mongo2
  • mongo2 gets elected as the new primary
  • rs.status() between mongo2 and mongo3 isn't equal. Inserting data on mongo2 doesn't show on mongo3
  • start mongo1
  • mongo1 says all servers are ok, mongo2 and mongo3 still shows the same status as before

The logs on mongo1 are +2 hours, I corrected the time on this machine later with the same results
I should also add that the replicaset is specified in a config file like this:
Mongo1:
replSet=DuegoWeb
journal=true
Mongo2:
replSet=DuegoWeb/mongo1.lan,mongo2.lan
Mongo3:
replSet=DuegoWeb/mongo3.lan,mongo1.lan
Mongo4:
replSet=DuegoWeb/mongo4.lan,mongo1.lan

Everything works and gets in sync as long as I restart the mongodb servers manually, but they never reconnect automatically



 Comments   
Comment by Kristina Chodorow (Inactive) [ 28/Oct/11 ]

This should be fixed by SERVER-3715

Comment by Kristina Chodorow (Inactive) [ 19/Oct/11 ]

If you're still having problems with this, can you try the latest Development Release (Unstable) version from http://www.mongodb.org/downloads? I recently added some code that should make heartbeat reconnection much more aggressive.

Comment by Johnny Boy [ 06/Apr/11 ]

Ok here is run #2
I've left mongo4 out this time, so don't bother it being down in the logs.

This time I only started the 3 servers, but mongo3 which was elected as a primary never showed up as being ok in rs.status() from mongo1 and mongo2 while mongo3 did see 1 and 2 as being ok.
Only when restarting mongo1 and mongo2 did everything show up as ok and only then did they replicate from mongo3.

I'm new to replication sets but this behavior seem really odd to me and in my mind they should try to reconnect by themselves without me having to restart them manually.

Attached is the full logs from mongo1, 2 and 3

This is what rs.status() shows in mongo1 (and equivalent in mongo2) before restarting them:

DuegoWeb:SECONDARY> rs.status()
{
"set" : "DuegoWeb",
"date" : ISODate("2011-04-06T09:20:37Z"),
"myState" : 2,
"members" : [
{
"_id" : 0,
"name" : "mongo1.lan",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" :

{ "t" : 1302002256000, "i" : 2 }

,
"optimeDate" : ISODate("2011-04-05T11:17:36Z"),
"self" : true
},
{
"_id" : 1,
"name" : "mongo2.lan",
"health" : 0,
"state" : 6,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2011-04-06T09:20:37Z"),
"errmsg" : "not running with --replSet"
},
{
"_id" : 2,
"name" : "mongo3.lan",
"health" : 0,
"state" : 6,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2011-04-06T09:20:37Z"),
"errmsg" : "not running with --replSet"
},
{
"_id" : 3,
"name" : "mongo4.lan",
"health" : 0,
"state" : 6,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2011-04-06T09:20:37Z"),
"errmsg" : "not running with --replSet"
}
],
"ok" : 1
}

and this is after restarting:

DuegoWeb:SECONDARY> rs.status()
{
"set" : "DuegoWeb",
"date" : ISODate("2011-04-06T09:20:59Z"),
"myState" : 2,
"members" : [
{
"_id" : 0,
"name" : "mongo1.lan",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" :

{ "t" : 1302081483000, "i" : 1 }

,
"optimeDate" : ISODate("2011-04-06T09:18:03Z"),
"self" : true
},
{
"_id" : 1,
"name" : "mongo2.lan",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 2,
"optime" :

{ "t" : 1302081483000, "i" : 1 }

,
"optimeDate" : ISODate("2011-04-06T09:18:03Z"),
"lastHeartbeat" : ISODate("2011-04-06T09:20:57Z")
},
{
"_id" : 2,
"name" : "mongo3.lan",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 8,
"optime" :

{ "t" : 1302081483000, "i" : 1 }

,
"optimeDate" : ISODate("2011-04-06T09:18:03Z"),
"lastHeartbeat" : ISODate("2011-04-06T09:20:57Z")
},
{
"_id" : 3,
"name" : "mongo4.lan",
"health" : 0,
"state" : 6,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2011-04-06T09:20:57Z"),
"errmsg" : "not running with --replSet"
}
],
"ok" : 1
}

Also the ip adresses to help with the logs:
mongo1.lan = 10.101.5.1
mongo2.lan = 10.101.5.2
mongo3.lan = 10.101.5.3

Comment by Kristina Chodorow (Inactive) [ 05/Apr/11 ]

Great, thanks.

Comment by Johnny Boy [ 05/Apr/11 ]

Sure thing, I'll run it again a little later and try to save as much as possible.
I could probably leave mongo4 out as well

Comment by Kristina Chodorow (Inactive) [ 05/Apr/11 ]

It is difficult for me to decipher what is happening with the mongo1 logs from a different time (and the mongo4 logs are missing). Could you run the experiment again (with -vvvvv, if possible) and zip up the logs again? Please send the whole logs, not just the relevant subsections. I'd rather have a billion extra log lines than miss something important at the edges!

Generated at Thu Feb 08 03:01:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.