Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: 3.2.0
Affects Version/s: 2.4.6
Component/s: Replication
Labels:
- elections
Environment:
Ubuntu 12.04LTS

Assigned Teams:

Replication
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

This is a continuation of ~~SERVER-9464~~ which was (apparently) not really fixed, see the original case.

This appears to be a deadlock between priority and freshness.
When perusing the original case's repro script after db0 is killed (Waiting for db1 to become primary):
db1 log is filled with:

[rsMgr] not electing self, we are not freshest

db2 log is filled with:

[rsMgr] not electing self, 127.0.0.1:30001 would veto with '127.0.0.1:30002 has lower priority than 127.0.0.1:30001'

I guess the load generator timing is important to reproducability. If db2 replicates from db0 faster than db1 does, it always gets into this deadlock.

test:SECONDARY> rs.status()
{
 "set" : "test",
 "date" : ISODate("2013-08-26T12:43:18Z"),
 "myState" : 2,
 "members" : [
 { "_id" : 0, "name" : "127.0.0.1:30000", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "optime" : Timestamp(1377520302, 6340), "optimeDate" : ISODate("2013-08-26T12:31:42Z"), "lastHeartbeat" : ISODate("2013-08-26T12:43:17Z"), "lastHeartbeatRecv" : ISODate("2013-08-26T12:31:42Z"), "pingMs" : 0 },
 { "_id" : 1, "name" : "127.0.0.1:30001", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 730, "optime" : Timestamp(1377520303, 6044), "optimeDate" : ISODate("2013-08-26T12:31:43Z"), "self" : true },
{ "_id" : 2, "name" : "127.0.0.1:30002", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 718, "optime" : Timestamp(1377520303, 6045), "optimeDate" : ISODate("2013-08-26T12:31:43Z"), "lastHeartbeat" : ISODate("2013-08-26T12:43:17Z"), "lastHeartbeatRecv" : ISODate("2013-08-26T12:43:17Z"), "pingMs" : 0, "syncingTo" : "127.0.0.1:30000" } 
],
 "ok" : 1
}

Also, while this might be unrelated, there are following entries in the log of db1:

[rsSyncNotifier] replset setting oplog notifier to 127.0.0.1:30000
[rsSyncNotifier] Socket flush send() errno:9 Bad file descriptor 127.0.0.1:30000
[rsSyncNotifier] caught exception (socket exception [SEND_ERROR] for 127.0.0.1:30000) in destructor (~PiggyBackData)

I'm not sure what's your policy on exceptions in destructors of course, but this might be an indicator of an improperly-cleaned up connection.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

loadgen.rb
Aug 26 2013 12:52:21 PM UTC
0.4 kB
Aristarkh Zagorodnikov
test.sh
Aug 26 2013 12:52:21 PM UTC
3 kB
Aristarkh Zagorodnikov

related to

SERVER-9464 On election, relax priority restriction when another member is fresher

Closed

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Aristarkh Zagorodnikov
Participants:: [DO NOT USE] Backlog - Replication Team, Aristarkh Zagorodnikov, Eric Milkie, Matt Dannenberg
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: Aug 26 2013 12:49:40 PM UTC
Updated:: Dec 06 2022 05:18:28 AM UTC
Resolved:: Dec 21 2015 09:24:00 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates