Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-15060

Investigate changes in SERVER-56756: Primary cannot stepDown when experiencing disk failures

    XMLWordPrintableJSON

Details

    Description

      Downstream Change Summary

      We are adding parameter fassertOnLockTimeoutForStepUpDown which controls whether we will fassert the server if we time out getting lock for a Step Up or a Step Down command. This allows for a cluster to elect a new primary in rare error conditions, such as a disk failure. For more information please look at related SERVER-56756

      Description of Linked Ticket

      Sending a step down request to a primary that is experiencing disk failures could result in consistent time-out errors:

      {
              "operationTime" : Timestamp(1620337238, 857),
              "ok" : 0,
              "errmsg" : "Could not acquire the global shared lock before the deadline for stepdown",
              "code" : 262,
              "codeName" : "ExceededTimeLimit",
              "$gleStats" : {
                      "lastOpTime" : Timestamp(0, 0),
                      "electionId" : ObjectId("7fffffff0000000000000001")
              },
              "lastCommittedOpTime" : Timestamp(1620337238, 327),
              "$configServerState" : {
                      "opTime" : {
                              "ts" : Timestamp(1620337306, 1),
                              "t" : NumberLong(1)
                      }
              },
              "$clusterTime" : {
                      "clusterTime" : Timestamp(1620337306, 1),
                      "signature" : {
                              "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                              "keyId" : NumberLong(0)
                      }
              }
      }
      

      The error is returned from here and the behavior is easy to reproduce. I've tested the behavior on v4.0.23.

      Also, I tried to attach GDB to the primary to collect stack-traces, but GDB hangs and I haven't been able to find an alternative yet.

      Attachments

        Activity

          People

            jocelyn.mendez@mongodb.com Jocelyn Mendez
            backlog-server-pm Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:
              1 year, 50 weeks, 1 day ago