Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-15060

Investigate changes in SERVER-56756: Primary cannot stepDown when experiencing disk failures

      Downstream Change Summary

      We are adding parameter fassertOnLockTimeoutForStepUpDown which controls whether we will fassert the server if we time out getting lock for a Step Up or a Step Down command. This allows for a cluster to elect a new primary in rare error conditions, such as a disk failure. For more information please look at related SERVER-56756

      Description of Linked Ticket

      Sending a step down request to a primary that is experiencing disk failures could result in consistent time-out errors:

      {
              "operationTime" : Timestamp(1620337238, 857),
              "ok" : 0,
              "errmsg" : "Could not acquire the global shared lock before the deadline for stepdown",
              "code" : 262,
              "codeName" : "ExceededTimeLimit",
              "$gleStats" : {
                      "lastOpTime" : Timestamp(0, 0),
                      "electionId" : ObjectId("7fffffff0000000000000001")
              },
              "lastCommittedOpTime" : Timestamp(1620337238, 327),
              "$configServerState" : {
                      "opTime" : {
                              "ts" : Timestamp(1620337306, 1),
                              "t" : NumberLong(1)
                      }
              },
              "$clusterTime" : {
                      "clusterTime" : Timestamp(1620337306, 1),
                      "signature" : {
                              "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                              "keyId" : NumberLong(0)
                      }
              }
      }
      

      The error is returned from here and the behavior is easy to reproduce. I've tested the behavior on v4.0.23.

      Also, I tried to attach GDB to the primary to collect stack-traces, but GDB hangs and I haven't been able to find an alternative yet.

            Assignee:
            jocelyn.mendez@mongodb.com Jocelyn Mendez
            Reporter:
            backlog-server-pm Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved:
              10 weeks ago