Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-49273

replSetStepDown v4.4 often fails with Unable to acquire lock '{4611686018427387905: ReplicationStateTransition, 1}' within 20000ms' milliseconds

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • ALL
    • Execution Team 2020-09-21

      Drivers use the replSetStepDown command in a few tests. Ideally the replSetStepDown should cause a reliable primary change (even better if it happens fast). Starting near the beginning of May (CDRIVER-3662 with v4.5.0-1128-g7ee768a), drivers have begun to see the replSetStepDown fail frequently on macOS and Windows. The failure looks like this:

        {
          "ok" : 0,
          "errmsg" : "Unable to acquire X lock on '{4611686018427387905: ReplicationStateTransition, 1}' within 5000ms. ...",
          "code" : 24,
          "codeName" : "LockTimeout",
        }
      

      While debugging these failures in the C# driver, we saw the following behavior when using {"replSetStepDown":30,"secondaryCatchUpPeriodSecs":30,"force":false} without replSetFreeze on 4.4.0-rc10-34-gcfd8b75. C# is a little unique in that it runs each replSetStepDown test twice in a row to test both async+sync codepaths. In the attached logs, the first execution of the test succeeds and the second fails because the election takes over 30 seconds to complete.

      1. At time 18:08.907 the primary receives the replSetStepDownTest request:
        {"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn1752","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"replSetStepDownTest","command":{"replSetStepDown":30,"secondaryCatchUpPeriodSecs":30,"force":false,"$db":"admin","lsid":{"id":{"$uuid":"51d0c38e-a100-457c-9249-4e87c426e493"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1593001088,"i":9}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}}},"numYields":0,"reslen":163,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1,"W":1}},"Global":{"acquireCount":{"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"w":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":1}}
        
      2. Meanwhile the secondary is not eligible to become primary yet (this will be solved by replSetFreeze):
         {"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I",  "c":"COMMAND",  "id":21581,   "ctx":"conn2951","msg":"Received replSetStepUp request"}
         {"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I",  "c":"ELECTION", "id":4615657, "ctx":"conn2951","msg":"Not starting an election for a replSetStepUp request, since we are not electable","attr":{"reason":"Not standing for election because I am still waiting for   stepdown period to end at 2020-06-24T13:18:29.331+01:00 (mask 0x10)"}}
         {"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I",  "c":"COMMAND",  "id":21582,   "ctx":"conn2951","msg":"replSetStepUp request failed","attr":{"error":" :: caused by :: CommandFailed: Election failed."}}
        
      3. Shortly before the secondary becomes eligible, the ident dropper task begins and doesn't stop until 11 seconds later (from 18:26.157 to 18:37.369). This might be related to SERVER-48154:
         {"t":{"$date":"2020-06-24T13:18:26.157+01:00"},"s":"I",  "c":"STORAGE",  "id":22260,   "ctx":"TimestampMonitor","msg":"Removing drop-pending idents with drop timestamps before timestamp","attr":{"timestamp":{"$timestamp":{"t":1593001083,"i":9}}}}
         {"t":{"$date":"2020-06-24T13:18:26.158+01:00"},"s":"I",  "c":"STORAGE",  "id":22237,   "ctx":"TimestampMonitor","msg":"Completing drop for ident","attr":{"ident":"index-2384-4567462317691918719","namespace":"keyvault.datakeys","dropTimestamp":{"$timestamp":{"t":        1593001044,"i":11}}}}
        ... A lot more ident drops...
         {"t":{"$date":"2020-06-24T13:18:37.369+01:00"},"s":"I",  "c":"STORAGE",  "id":22237,   "ctx":"TimestampMonitor","msg":"Completing drop for ident","attr":{"ident":"collection-3663-4567462317691918719","namespace":"Tests06241216.testcollection","dropTimestamp":           {"$timestamp":{"t":1593001078,"i":6}}}}
        ...
        
      4. The secondary is unblocked and finally is able to step up:
         {"t":{"$date":"2020-06-24T13:18:39.625+01:00"},"s":"I",  "c":"REPL",     "id":21331,   "ctx":"OplogApplier-0","msg":"Transition to   primary complete; database writes are now permitted"}
        

      Log are attached: drivers-1290.zip

            Assignee:
            shane.harvey@mongodb.com Shane Harvey
            Reporter:
            shane.harvey@mongodb.com Shane Harvey
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: