[DOCS-15060] Investigate changes in SERVER-56756: Primary cannot stepDown when experiencing disk failures Created: 24/Jan/22  Updated: 13/Nov/23  Resolved: 23/Feb/22

Status: Closed
Project: Documentation
Component/s: manual, Server
Affects Version/s: None
Fix Version/s: 5.3.0, Server_Docs_20231030, Server_Docs_20231106, Server_Docs_20231105, Server_Docs_20231113

Type: Task Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Jocelyn Mendez
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
documents SERVER-56756 Primary cannot stepDown when experien... Closed
Participants:
Days since reply: 1 year, 50 weeks, 1 day ago
Epic Link: DOCSP-19447

 Description   
Downstream Change Summary

We are adding parameter fassertOnLockTimeoutForStepUpDown which controls whether we will fassert the server if we time out getting lock for a Step Up or a Step Down command. This allows for a cluster to elect a new primary in rare error conditions, such as a disk failure. For more information please look at related SERVER-56756

Description of Linked Ticket

Sending a step down request to a primary that is experiencing disk failures could result in consistent time-out errors:

{
        "operationTime" : Timestamp(1620337238, 857),
        "ok" : 0,
        "errmsg" : "Could not acquire the global shared lock before the deadline for stepdown",
        "code" : 262,
        "codeName" : "ExceededTimeLimit",
        "$gleStats" : {
                "lastOpTime" : Timestamp(0, 0),
                "electionId" : ObjectId("7fffffff0000000000000001")
        },
        "lastCommittedOpTime" : Timestamp(1620337238, 327),
        "$configServerState" : {
                "opTime" : {
                        "ts" : Timestamp(1620337306, 1),
                        "t" : NumberLong(1)
                }
        },
        "$clusterTime" : {
                "clusterTime" : Timestamp(1620337306, 1),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
}

The error is returned from here and the behavior is easy to reproduce. I've tested the behavior on v4.0.23.

Also, I tried to attach GDB to the primary to collect stack-traces, but GDB hangs and I haven't been able to find an alternative yet.



 Comments   
Comment by Githook User [ 22/Feb/22 ]

Author:

{'name': 'jocelyn-mendez1', 'email': '91144778+jocelyn-mendez1@users.noreply.github.com', 'username': 'jocelyn-mendez1'}

Message: DOCS-15060 fassertOnLockTimeoutForStepUpDown parameter (#627)

Co-authored-by: Jocelyn Mendez <jocelyn.mendez@Jocelyns-MacBook-Pro.local>
Branch: master
https://github.com/10gen/docs-mongodb-internal/commit/38bf71fc19e78770e91e8fe40b98ca9aeae6087a

Comment by PM Bot [ 24/Jan/22 ]

Downstream changes updated for upstream SERVER-56756:
We are adding parameter fassertOnLockTimeoutForStepUpDown which controls whether we will fassert the server if we time out getting lock for a Step Up or a Step Down command.

Generated at Thu Feb 08 08:11:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.