[SERVER-10621] Replication failover hangs indefinitely when priorities conflict with replication Created: 26/Aug/13  Updated: 06/Dec/22  Resolved: 21/Dec/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.6
Fix Version/s: 3.2.0

Type: Bug Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Backlog - Replication Team
Resolution: Done Votes: 0
Labels: elections
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 12.04LTS


Attachments: File loadgen.rb     File test.sh    
Issue Links:
Duplicate
Related
related to SERVER-9464 On election, relax priority restricti... Closed
Assigned Teams:
Replication
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Case:

 Description   

This is a continuation of SERVER-9464 which was (apparently) not really fixed, see the original case.

This appears to be a deadlock between priority and freshness.
When perusing the original case's repro script after db0 is killed (Waiting for db1 to become primary):
db1 log is filled with:

[rsMgr] not electing self, we are not freshest

db2 log is filled with:

[rsMgr] not electing self, 127.0.0.1:30001 would veto with '127.0.0.1:30002 has lower priority than 127.0.0.1:30001'

I guess the load generator timing is important to reproducability. If db2 replicates from db0 faster than db1 does, it always gets into this deadlock.

test:SECONDARY> rs.status()
{
 "set" : "test",
 "date" : ISODate("2013-08-26T12:43:18Z"),
 "myState" : 2,
 "members" : [
 { "_id" : 0, "name" : "127.0.0.1:30000", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "optime" : Timestamp(1377520302, 6340), "optimeDate" : ISODate("2013-08-26T12:31:42Z"), "lastHeartbeat" : ISODate("2013-08-26T12:43:17Z"), "lastHeartbeatRecv" : ISODate("2013-08-26T12:31:42Z"), "pingMs" : 0 },
 { "_id" : 1, "name" : "127.0.0.1:30001", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 730, "optime" : Timestamp(1377520303, 6044), "optimeDate" : ISODate("2013-08-26T12:31:43Z"), "self" : true },
{ "_id" : 2, "name" : "127.0.0.1:30002", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 718, "optime" : Timestamp(1377520303, 6045), "optimeDate" : ISODate("2013-08-26T12:31:43Z"), "lastHeartbeat" : ISODate("2013-08-26T12:43:17Z"), "lastHeartbeatRecv" : ISODate("2013-08-26T12:43:17Z"), "pingMs" : 0, "syncingTo" : "127.0.0.1:30000" } 
],
 "ok" : 1
}

Also, while this might be unrelated, there are following entries in the log of db1:

[rsSyncNotifier] replset setting oplog notifier to 127.0.0.1:30000
[rsSyncNotifier] Socket flush send() errno:9 Bad file descriptor 127.0.0.1:30000
[rsSyncNotifier] caught exception (socket exception [SEND_ERROR] for 127.0.0.1:30000) in destructor (~PiggyBackData)

I'm not sure what's your policy on exceptions in destructors of course, but this might be an indicator of an improperly-cleaned up connection.



 Comments   
Comment by Eric Milkie [ 21/Dec/15 ]

A new election scheme for priority was introduced in 3.2, which does not have this problem.

Comment by Aristarkh Zagorodnikov [ 23/Oct/13 ]

Any news on this one? We resorted to resetting all priorities to default (1), but this is not what we would like to use =)

Comment by Aristarkh Zagorodnikov [ 28/Aug/13 ]

The following patch fixes things for me. Not sure if relaxing priorities based on possibly stale (due to heartbeats) optime comparison is a good solution though.

$ git diff --minimal
diff --git a/src/mongo/db/repl/consensus.cpp b/src/mongo/db/repl/consensus.cpp
index da75037..10bd928 100644
--- a/src/mongo/db/repl/consensus.cpp
+++ b/src/mongo/db/repl/consensus.cpp
@@ -72,9 +72,12 @@ namespace mongo {
 
             if (highestPriority &&
                 highestPriority->config().priority > hopeful->config().priority) {
-                errmsg = str::stream() << hopeful->fullName() << " has lower priority than " <<
-                    highestPriority->fullName();
-                return true;
+                OpTime lastTime = highestPriority->h().isSelf() ? theReplSet->lastOpTimeWritten : highestPriority->hbinfo().opTime;
+                if( lastTime >= hopeful->hbinfo().opTime ) {
+                    errmsg = str::stream() << hopeful->fullName() << " has lower priority than " <<
+                        highestPriority->fullName();
+                    return true;
+                }
             }
 
             if (!theReplSet->isElectable(id)) {
@@ -247,10 +250,20 @@ namespace mongo {
             vote = -10000;
         }
         else if( highestPriority && highestPriority->config().priority > hopeful->config().priority) {
-            log() << hopeful->fullName() << " has lower priority than " << highestPriority->fullName();
-            vote = -10000;
+            OpTime lastTime = highestPriority->h().isSelf() ? theReplSet->lastOpTimeWritten : highestPriority->hbinfo().opTime;
+            if( lastTime >= hopeful->hbinfo().opTime ) {
+                log() << hopeful->fullName() << " has lower priority than " << highestPriority->fullName();
+                vote = -10000;
+            }
+            else {
+                vote = 1;
+            }
         }
         else {
+            vote = 1;
+        }
+        if( vote > 0 )
+        {
             try {
                 vote = yea(whoid);
                 dassert( hopeful->id() == whoid );

Comment by Aristarkh Zagorodnikov [ 27/Aug/13 ]

P.P.S. If you would like better reproducability, I believe that doing some kind of rate limiting (via nice or TCP shaping) might do the trick.

Comment by Aristarkh Zagorodnikov [ 27/Aug/13 ]

P.S. We originally got this problem on shard replica sets that have oplogs that are measured in months (54 days with current, actually low, write volume):

driveFS-4:PRIMARY> db.oplog.rs.stats()
{
        "ns" : "local.oplog.rs",
        "count" : 7789862,
        "size" : 209582574428,
        "avgObjSize" : 26904.52981426372,
        "storageSize" : 209715202544,
        "numExtents" : 98,
        "nindexes" : 0,
        "lastExtentSize" : 1511796736,
        "paddingFactor" : 1,
        "systemFlags" : 0,
        "userFlags" : 0,
        "totalIndexSize" : 0,
        "indexSizes" : {
 
        },
        "capped" : true,
        "max" : NumberLong("9223372036854775807"),
        "ok" : 1
}
driveFS-4:PRIMARY> (db.oplog.rs.find({},{ts:1}).sort({$natural:-1}).limit(1)[0].ts.t - db.oplog.rs.find({},{ts:1}).sort({$natural:1}).limit(1)[0].ts.t)/86400
54.050439814814816

Comment by Aristarkh Zagorodnikov [ 27/Aug/13 ]

The oplog size is irrelevant (it may affect the reproducibility because of I/O factors, but it's still not the root cause). Please pay attention to the report, let me indicate the important part:

test:SECONDARY> rs.status()
{
...
{ "_id" : 1,.. "stateStr" : "SECONDARY",.. "optime" : Timestamp(1377520303, 6044),.. },
{ "_id" : 2,.. "stateStr" : "SECONDARY",.. "optime" : Timestamp(1377520303, 6045),.. },
 
],..
}

As it is visible from this case, the timestamps of last operations are one operation apart from each other.
Since I left these servers running, I could extract the oplog info from both of them even as they are still stalled as secondaries.
db1:

test:SECONDARY> db.oplog.rs.count()
32479
test:SECONDARY> db.oplog.rs.find().sort({$natural:-1}).limit(2)
{ "ts" : Timestamp(1377520303, 6044), "h" : NumberLong("6912852358546632990"), "v" : 2, "op" : "u", "ns" : "test.test", "o2" : { "_i   d" : 0 }, "o" : { "$set" : { "v" : 32478 } } }
{ "ts" : Timestamp(1377520303, 6043), "h" : NumberLong("-2121148984826622410"), "v" : 2, "op" : "u", "ns" : "test.test", "o2" : { "_   id" : 0 }, "o" : { "$set" : { "v" : 32477 } } }

db1:

test:SECONDARY> db.oplog.rs.count()
32480
test:SECONDARY> db.oplog.rs.find().sort({$natural:-1}).limit(2)
{ "ts" : Timestamp(1377520303, 6045), "h" : NumberLong("236059633665595751"), "v" : 2, "op" : "u", "ns" : "test.test", "o2" : { "_id" : 0 }, "o" : { "$set" : { "v" : 32479 } } }
{ "ts" : Timestamp(1377520303, 6044), "h" : NumberLong("6912852358546632990"), "v" : 2, "op" : "u", "ns" : "test.test", "o2" : { "_id" : 0 }, "o" : { "$set" : { "v" : 32478 } } }

Comment by Matt Dannenberg [ 27/Aug/13 ]

I tried your repro and it was hanging as you've described about half of the time. However, when I changed the oplog size to be two orders of magnitude larger (one may have sufficed, but I didn't try it), I no longer encountered the failure. What I believe is happening the higher priority less fresh member cannot catch up because its most recent oplog entry is no longer in the lower priority member's oplog (despite being less than 10 mins behind).

Comment by Aristarkh Zagorodnikov [ 26/Aug/13 ]

Attached original case's repro scripts

Generated at Thu Feb 08 03:23:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.