[SERVER-49273] replSetStepDown v4.4 often fails with Unable to acquire lock '{4611686018427387905: ReplicationStateTransition, 1}' within 20000ms' milliseconds Created: 01/Jul/20  Updated: 07/Oct/20  Resolved: 07/Oct/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Shane Harvey Assignee: Shane Harvey
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive drivers-1290.zip    
Issue Links:
Depends
Related
is related to SERVER-48154 ident dropper should periodically yie... Closed
Operating System: ALL
Sprint: Execution Team 2020-09-21
Participants:

 Description   

Drivers use the replSetStepDown command in a few tests. Ideally the replSetStepDown should cause a reliable primary change (even better if it happens fast). Starting near the beginning of May (CDRIVER-3662 with v4.5.0-1128-g7ee768a), drivers have begun to see the replSetStepDown fail frequently on macOS and Windows. The failure looks like this:

  {
    "ok" : 0,
    "errmsg" : "Unable to acquire X lock on '{4611686018427387905: ReplicationStateTransition, 1}' within 5000ms. ...",
    "code" : 24,
    "codeName" : "LockTimeout",
  }

While debugging these failures in the C# driver, we saw the following behavior when using {"replSetStepDown":30,"secondaryCatchUpPeriodSecs":30,"force":false} without replSetFreeze on 4.4.0-rc10-34-gcfd8b75. C# is a little unique in that it runs each replSetStepDown test twice in a row to test both async+sync codepaths. In the attached logs, the first execution of the test succeeds and the second fails because the election takes over 30 seconds to complete.

  1. At time 18:08.907 the primary receives the replSetStepDownTest request:

    {"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn1752","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"replSetStepDownTest","command":{"replSetStepDown":30,"secondaryCatchUpPeriodSecs":30,"force":false,"$db":"admin","lsid":{"id":{"$uuid":"51d0c38e-a100-457c-9249-4e87c426e493"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1593001088,"i":9}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}}},"numYields":0,"reslen":163,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1,"W":1}},"Global":{"acquireCount":{"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"w":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":1}}
    

  2. Meanwhile the secondary is not eligible to become primary yet (this will be solved by replSetFreeze):

     {"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I",  "c":"COMMAND",  "id":21581,   "ctx":"conn2951","msg":"Received replSetStepUp request"}
     {"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I",  "c":"ELECTION", "id":4615657, "ctx":"conn2951","msg":"Not starting an election for a replSetStepUp request, since we are not electable","attr":{"reason":"Not standing for election because I am still waiting for   stepdown period to end at 2020-06-24T13:18:29.331+01:00 (mask 0x10)"}}
     {"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I",  "c":"COMMAND",  "id":21582,   "ctx":"conn2951","msg":"replSetStepUp request failed","attr":{"error":" :: caused by :: CommandFailed: Election failed."}}
    

  3. Shortly before the secondary becomes eligible, the ident dropper task begins and doesn't stop until 11 seconds later (from 18:26.157 to 18:37.369). This might be related to SERVER-48154:

     {"t":{"$date":"2020-06-24T13:18:26.157+01:00"},"s":"I",  "c":"STORAGE",  "id":22260,   "ctx":"TimestampMonitor","msg":"Removing drop-pending idents with drop timestamps before timestamp","attr":{"timestamp":{"$timestamp":{"t":1593001083,"i":9}}}}
     {"t":{"$date":"2020-06-24T13:18:26.158+01:00"},"s":"I",  "c":"STORAGE",  "id":22237,   "ctx":"TimestampMonitor","msg":"Completing drop for ident","attr":{"ident":"index-2384-4567462317691918719","namespace":"keyvault.datakeys","dropTimestamp":{"$timestamp":{"t":        1593001044,"i":11}}}}
    ... A lot more ident drops...
     {"t":{"$date":"2020-06-24T13:18:37.369+01:00"},"s":"I",  "c":"STORAGE",  "id":22237,   "ctx":"TimestampMonitor","msg":"Completing drop for ident","attr":{"ident":"collection-3663-4567462317691918719","namespace":"Tests06241216.testcollection","dropTimestamp":           {"$timestamp":{"t":1593001078,"i":6}}}}
    ...
    

  4. The secondary is unblocked and finally is able to step up:

     {"t":{"$date":"2020-06-24T13:18:39.625+01:00"},"s":"I",  "c":"REPL",     "id":21331,   "ctx":"OplogApplier-0","msg":"Transition to   primary complete; database writes are now permitted"}
    

Log are attached: drivers-1290.zip



 Comments   
Comment by Eric Milkie [ 07/Oct/20 ]

Great, thank you! The slow elections on Windows are odd and unexpected, but unlikely to be related to this particular ticket work.

Comment by Shane Harvey [ 06/Oct/20 ]

I no longer see LockTimeout errors when running replSetStepDown. However, when I remove the replSetFreeze:0 added in DRIVERS-1290 I do still see slow elections on Windows. By "slow elections" I mean that even after the replSetStepDown command succeeds, it still takes >15 seconds for the new primary to take over. Running replSetFreeze:0 on a secondary before replSetStepDown fixes this issue and ensures a speedy election.

I'm not sure if the slow elections are expected but I think we can close this issue.

Comment by Eric Milkie [ 28/Sep/20 ]

shane.harvey we backported SERVER-48154 to the 4.4 branch on Sept 11; server builds from the 4.4 branch after that date should no longer exhibit this issue. Can you investigate if the failures have stopped? I think we can unblock DRIVERS-1290 now.

Comment by Eric Milkie [ 09/Sep/20 ]

We're going to backport SERVER-48154 to 4.4 to see if that clears up this issue.

Generated at Thu Feb 08 05:19:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.