[SERVER-48154] ident dropper should periodically yield Global IS lock Created: 12/May/20  Updated: 29/Oct/23  Resolved: 30/Jul/20

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 4.7.0, 4.4.2

Type: Improvement Priority: Major - P3
Reporter: Eric Milkie Assignee: Gregory Wlodarek
Resolution: Fixed Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-49273 replSetStepDown v4.4 often fails with... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: Execution Team 2020-08-10
Participants:
Linked BF Score: 8

 Description   

If there are hundreds of idents to drop, the dropper can take 10 seconds or more, during which time stepdown and shutdown are blocked.



 Comments   
Comment by Githook User [ 11/Sep/20 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-48154 Ident reaper should periodically yield the global lock

(cherry picked from commit 34048207ba57b626063d3d940eb5f3ed65203039)
Branch: v4.4
https://github.com/mongodb/mongo/commit/425ca4de28d442a4d636fa55b1773567a0885d03

Comment by Githook User [ 30/Jul/20 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-48154 Ident reaper should periodically yield the global lock
Branch: master
https://github.com/mongodb/mongo/commit/34048207ba57b626063d3d940eb5f3ed65203039

Comment by Eric Milkie [ 23/Jun/20 ]

shane.harvey you can raise the timeout by choosing a longer secondaryCatchUpPeriodSecs parameter. In SERVER-48107, because force:true was being used, that wasn't an option. I don't think this ticket is generally the cause of the timeouts in the driver tests.

Comment by Dmitry Lukyanov (Inactive) [ 23/Jun/20 ]

See this failed test for details:  https://evergreen.mongodb.com/task_log_raw/dot_net_driver_unsecure_tests__version~4.4_os~windows_64_topology~replicaset_auth~noauth_ssl~nossl_test_netstandard15_patch_297fcd723ff32aac47b7b018c978ec4baf0773d7_5ef14d1b32f4170a354b8a5a_20_06_23_00_30_45/0?type=T#L1721
Retry logic has been called 9 times and still failed.

Comment by Shane Harvey [ 23/Jun/20 ]

My understanding is that this issue causes the replSetStepDown command to frequently fail with the following transient error (see DRIVERS-1290 and SERVER-48107):

  {
    "ok" : 0,
    "errmsg" : "Unable to acquire X lock on '{4611686018427387905: ReplicationStateTransition, 1}' within 1000ms. opId: 922, op: conn30, connId: 30.",
    "code" : 24,
    "codeName" : "LockTimeout",
  }

So far I've only seen this error occur on 4.4 and 4.5-latest and only on macOS and Windows. Moreover, even with a 10 second retry loop some drivers (C#) are still seeing this error on Windows. Is there any way to prevent this from happening? Is there a command drivers can run directly before replSetStepDown that would wait for the "idents" to drop?

CC: dmitry.lukyanov

Comment by Gregory Wlodarek [ 15/Jun/20 ]

We obtain the Global IS lock here.

Generated at Thu Feb 08 05:16:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.