Details
-
Task
-
Resolution: Fixed
-
Major - P3
-
None
-
None
Description
We are adding parameter fassertOnLockTimeoutForStepUpDown which controls whether we will fassert the server if we time out getting lock for a Step Up or a Step Down command. This allows for a cluster to elect a new primary in rare error conditions, such as a disk failure. For more information please look at related SERVER-56756
Description of Linked Ticket
Sending a step down request to a primary that is experiencing disk failures could result in consistent time-out errors:
{
|
"operationTime" : Timestamp(1620337238, 857),
|
"ok" : 0,
|
"errmsg" : "Could not acquire the global shared lock before the deadline for stepdown",
|
"code" : 262,
|
"codeName" : "ExceededTimeLimit",
|
"$gleStats" : {
|
"lastOpTime" : Timestamp(0, 0),
|
"electionId" : ObjectId("7fffffff0000000000000001")
|
},
|
"lastCommittedOpTime" : Timestamp(1620337238, 327),
|
"$configServerState" : {
|
"opTime" : {
|
"ts" : Timestamp(1620337306, 1),
|
"t" : NumberLong(1)
|
}
|
},
|
"$clusterTime" : {
|
"clusterTime" : Timestamp(1620337306, 1),
|
"signature" : {
|
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
|
"keyId" : NumberLong(0)
|
}
|
}
|
}
|
The error is returned from here and the behavior is easy to reproduce. I've tested the behavior on v4.0.23.
Also, I tried to attach GDB to the primary to collect stack-traces, but GDB hangs and I haven't been able to find an alternative yet.
Attachments
Issue Links
- documents
-
SERVER-56756 Primary cannot stepDown when experiencing disk failures
-
- Closed
-