[SERVER-3039] distributed lock needs to be checked periodically Created: 04/May/11  Updated: 06/Dec/22  Resolved: 31/May/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 1.9.0
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Greg Studer Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Assigned Teams:
Sharding
Operating System: ALL
Participants:

 Description   

If processes using the distributed lock take longer than a few minutes to complete, it is possible that the config servers could go unresponsive during the process, come back up, and be queried by other processes before the still-running process has a chance to ping again, leading to an incorrect forcing. Think the easiest way to avoid this is to re-acquire the distributed lock every 5-7 mins or so in long-running processes (1/2 the timeout time)?



 Comments   
Comment by Ratika Gandhi [ 31/May/19 ]

We have largely moved away from Distributed locks

Comment by Greg Studer [ 02/May/14 ]

This has become less of an issue since 2.4, where we re-check the distributed lock state after the long-running migration data transfer. All other metadata operations should take much less time than 30 secs.

Comment by Greg Studer [ 18/May/11 ]

Another case - lock pinging fails, but this does not impact the acquisition of a dist lock - it's separate. We need to disallow if last lock ping failed, or some similar logic.

Comment by auto [ 12/May/11 ]

Author:

{u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

Message: don't remember pings on errors or successful dist_locking SERVER-3039
Branch: master
https://github.com/mongodb/mongo/commit/fcbcbdac954522f7af2cd804bf06ef1701ad9b42

Comment by Greg Studer [ 09/May/11 ]

agreed - was thinkiing the simplest way of implementing this was just an additional method in dist_lock_try (retry()) you'd call periodically to check that - right now it actually sort of does this, in that if we can't read and write to every config server, the retry() call fails, but it doesn't explicitly check the lock pings. Think that might be unnecessary? - if our config servers are up and we're able to read and write from config.locks, writing to config.lockpings should always get through too - if not it's a program error.

Comment by Eliot Horowitz (Inactive) [ 09/May/11 ]

One option is that of the lock pinger can't ping, we try and abort.
That might be the safest.

Comment by Greg Studer [ 09/May/11 ]

Thinking about that, but seems like the issue there is that there could be network segmentation so one mongos process does not see the config server go down, while the other process does - the lock pinger would throw errors, but nothing would stop any operations. The solution above should work, assuming that the ping on the lock is checked, b/c it the check can ensure that there is a new ping entry which will protect the lock for another 15 minutes, and if not abort (though any network error should also reset the last detected lock ping time - which currently is not the case.)

Comment by Eliot Horowitz (Inactive) [ 09/May/11 ]

Wouldn't that solution have the same race condition?

Maybe after a network failure, we disallow take overs for 5 minutes?

Generated at Thu Feb 08 03:01:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.