[SERVER-36315] After stepdown the CSRS dist lock manager keeps trying to unlock locks Created: 26/Jul/18  Updated: 29/Oct/23  Resolved: 26/Sep/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.16, 3.6.6, 4.0.0
Fix Version/s: 3.6.15, 4.0.13, 4.2.1, 4.3.1, 3.4.24

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Kevin Pulo
Resolution: Fixed Votes: 2
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2, v4.0, v3.6, v3.4
Sprint: Sharding 2019-07-29, Sharding 2019-08-12, Sharding 2019-08-26, Sharding 2019-10-07
Participants:
Case:

 Description   

It looks like after stepdown of the config server primary, if the dist lock manager had any locks which it held (or failed to unlock) while it was a primary, it will keep trying to unlock them even though it is not a primary anymore. This pollutes the logs with these messages:

2018-07-25T13:32:29.635-0400 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 5b45d865f11548390b954ea2 and _id: config-movePrimary :: caused by :: NotMaster: Not primary while running findAndModify command on collection config.locks



 Comments   
Comment by Githook User [ 26/Sep/19 ]

Author:

{'username': 'devkev', 'email': 'kevin.pulo@mongodb.com', 'name': 'Kevin Pulo'}

Message: SERVER-36315 Don't keep trying to unlock distlocks after CSRS stepdown

(cherry picked from commit 3197ace779d6e0f023bff9e39bd80e17e3c4790d)
Branch: v4.2
https://github.com/mongodb/mongo/commit/21381250dede7f387ffe41a6ccba71a498492831

Comment by Githook User [ 26/Sep/19 ]

Author:

{'username': 'devkev', 'email': 'kevin.pulo@mongodb.com', 'name': 'Kevin Pulo'}

Message: SERVER-36315 Don't keep trying to unlock distlocks after CSRS stepdown

(cherry picked from commit 3197ace779d6e0f023bff9e39bd80e17e3c4790d)
Branch: v4.0
https://github.com/mongodb/mongo/commit/5c5088b0a184ed6993735ffdb37de5025b57a456

Comment by Githook User [ 26/Sep/19 ]

Author:

{'name': 'Kevin Pulo', 'username': 'devkev', 'email': 'kevin.pulo@mongodb.com'}

Message: SERVER-36315 Don't keep trying to unlock distlocks after CSRS stepdown

(cherry picked from commit 3197ace779d6e0f023bff9e39bd80e17e3c4790d)
Branch: v3.6
https://github.com/mongodb/mongo/commit/6d190e48080930ddaf1a4f72bda86a47e5e3dc2a

Comment by Githook User [ 26/Sep/19 ]

Author:

{'name': 'Kevin Pulo', 'username': 'devkev', 'email': 'kevin.pulo@mongodb.com'}

Message: SERVER-36315 Don't keep trying to unlock distlocks after CSRS stepdown

(cherry picked from commit 3197ace779d6e0f023bff9e39bd80e17e3c4790d)
Branch: v3.4
https://github.com/mongodb/mongo/commit/a2343b191e3a0edf93cc7e9f22b8434d1f2a7c0d

Comment by Githook User [ 26/Sep/19 ]

Author:

{'username': 'devkev', 'email': 'kevin.pulo@mongodb.com', 'name': 'Kevin Pulo'}

Message: SERVER-36315 Don't keep trying to unlock distlocks after CSRS stepdown
Branch: master
https://github.com/mongodb/mongo/commit/3197ace779d6e0f023bff9e39bd80e17e3c4790d

Comment by Scott Glajch [ 05/Aug/19 ]

2019-08-05T20:53:22.417+0000 I CONTROL  [thread58] Failed to create config.system.sessions: Not primary while running findAndModify command on collection config.locks, will try again at the next refresh interval
2019-08-05T20:53:22.417+0000 I CONTROL  [thread58] Sessions collection is not set up; waiting until next sessions refresh interval: Not primary while running findAndModify command on collection config.locks
2019-08-05T20:53:22.489+0000 W SHARDING [replSetDistLockPinger] Failed to unlock lock with ts: 5d489742c286d3f0f6acfe6d and _id: config-movePrimary :: caused by :: NotMaster: Not primary while running findAndModify command on collection config.locks

Those are some related log messages, if that helps.  It seems every time that (thread58) message happens, it adds another duplicate log message per interval.

Comment by Scott Glajch [ 05/Aug/19 ]

Actually I spoke too soon, the message is still showing up, though only on the first replica that we restarted and only at 1 message per minute (instead of at 30K messages per minute).

Comment by Scott Glajch [ 05/Aug/19 ]

We just had this happen to us.  After not knowing exactly how to get these messages to stop on the 2 secondary config server nodes, we ended up just re-initial syncing the 2 replicas, since luckily the config servers don't house much data at all and that only took a few minutes to complete.
Do you have any better workaround steps for anyone else who hits this issue for now?

Comment by Kaloian Manassiev [ 24/Jun/19 ]

I do think this bug is a pretty bad experience, because it is 100% guaranteed that upon step-down of a CSRS primary, the secondaries will keep trying to unlock locks and with more and more locks that we put on the CSRS, this will keep filling the logs.

We should be able to register it with the ReplicationCoordinatorExternalState and mark it as stepped up/stepped down on replication state transitions (only for the CSRS). This is still not 100% race-condition proof, but at the very least it will not try to unlock any more locks on secondaries.

Putting it back in needs triage.

Generated at Thu Feb 08 04:42:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.