[SERVER-2899] Replicaset nodes doesn't reconnect after being down while rs.status() on the last started node shows all servers as being up Created: 05/Apr/11 Updated: 04/Feb/15 Resolved: 28/Oct/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 1.8.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Johnny Boy | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
FreeBSD 8.2 jail |
||
| Attachments: |
|
| Operating System: | FreeBSD |
| Participants: |
| Description |
|
I'm testing a replicaset with four mongodb 1.8.1-rc1 with each running in its own jail in FreeBSD 8.2. If I shutdown (a clean kill) one primary (a.k.a. mongo1) and one secondary (a.k.a. mongo2), the other two secondaries (a.k.a. mongo3 & mongo4) stays running and notices that the other two went away as they should. DuegoWeb:PRIMARY> rs.status() , , , , However if we move to mongo3 and also run the rs.status() it says mongo2 isn't available: , I find this confusing that the status() on mongo2 can say that mongo3 is ok, but not vice versa. If we then also start up mongo1, the rs.status() on this server says all servers are ok while mongo2 still doesn't show mongo1 as being up: , , , , The same rs.status() is still shown on mongo2 and mongo3 just like before mongo1 was started again. Sorry if my example is badly explained. I'll attach logs and all rs statuses, the order is:
The logs on mongo1 are +2 hours, I corrected the time on this machine later with the same results Everything works and gets in sync as long as I restart the mongodb servers manually, but they never reconnect automatically |
| Comments |
| Comment by Kristina Chodorow (Inactive) [ 28/Oct/11 ] |
|
This should be fixed by |
| Comment by Kristina Chodorow (Inactive) [ 19/Oct/11 ] |
|
If you're still having problems with this, can you try the latest Development Release (Unstable) version from http://www.mongodb.org/downloads? I recently added some code that should make heartbeat reconnection much more aggressive. |
| Comment by Johnny Boy [ 06/Apr/11 ] |
|
Ok here is run #2 This time I only started the 3 servers, but mongo3 which was elected as a primary never showed up as being ok in rs.status() from mongo1 and mongo2 while mongo3 did see 1 and 2 as being ok. I'm new to replication sets but this behavior seem really odd to me and in my mind they should try to reconnect by themselves without me having to restart them manually. Attached is the full logs from mongo1, 2 and 3 This is what rs.status() shows in mongo1 (and equivalent in mongo2) before restarting them: DuegoWeb:SECONDARY> rs.status() , , , , and this is after restarting: DuegoWeb:SECONDARY> rs.status() , , , , Also the ip adresses to help with the logs: |
| Comment by Kristina Chodorow (Inactive) [ 05/Apr/11 ] |
|
Great, thanks. |
| Comment by Johnny Boy [ 05/Apr/11 ] |
|
Sure thing, I'll run it again a little later and try to save as much as possible. |
| Comment by Kristina Chodorow (Inactive) [ 05/Apr/11 ] |
|
It is difficult for me to decipher what is happening with the mongo1 logs from a different time (and the mongo4 logs are missing). Could you run the experiment again (with -vvvvv, if possible) and zip up the logs again? Please send the whole logs, not just the relevant subsections. I'd rather have a billion extra log lines than miss something important at the edges! |