[SERVER-3039] distributed lock needs to be checked periodically Created: 04/May/11 Updated: 06/Dec/22 Resolved: 31/May/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 1.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Greg Studer | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Assigned Teams: |
Sharding
|
||||
| Operating System: | ALL | ||||
| Participants: | |||||
| Description |
|
If processes using the distributed lock take longer than a few minutes to complete, it is possible that the config servers could go unresponsive during the process, come back up, and be queried by other processes before the still-running process has a chance to ping again, leading to an incorrect forcing. Think the easiest way to avoid this is to re-acquire the distributed lock every 5-7 mins or so in long-running processes (1/2 the timeout time)? |
| Comments |
| Comment by Ratika Gandhi [ 31/May/19 ] |
|
We have largely moved away from Distributed locks |
| Comment by Greg Studer [ 02/May/14 ] |
|
This has become less of an issue since 2.4, where we re-check the distributed lock state after the long-running migration data transfer. All other metadata operations should take much less time than 30 secs. |
| Comment by Greg Studer [ 18/May/11 ] |
|
Another case - lock pinging fails, but this does not impact the acquisition of a dist lock - it's separate. We need to disallow if last lock ping failed, or some similar logic. |
| Comment by auto [ 12/May/11 ] |
|
Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}Message: don't remember pings on errors or successful dist_locking |
| Comment by Greg Studer [ 09/May/11 ] |
|
agreed - was thinkiing the simplest way of implementing this was just an additional method in dist_lock_try (retry()) you'd call periodically to check that - right now it actually sort of does this, in that if we can't read and write to every config server, the retry() call fails, but it doesn't explicitly check the lock pings. Think that might be unnecessary? - if our config servers are up and we're able to read and write from config.locks, writing to config.lockpings should always get through too - if not it's a program error. |
| Comment by Eliot Horowitz (Inactive) [ 09/May/11 ] |
|
One option is that of the lock pinger can't ping, we try and abort. |
| Comment by Greg Studer [ 09/May/11 ] |
|
Thinking about that, but seems like the issue there is that there could be network segmentation so one mongos process does not see the config server go down, while the other process does - the lock pinger would throw errors, but nothing would stop any operations. The solution above should work, assuming that the ping on the lock is checked, b/c it the check can ensure that there is a new ping entry which will protect the lock for another 15 minutes, and if not abort (though any network error should also reset the last detected lock ping time - which currently is not the case.) |
| Comment by Eliot Horowitz (Inactive) [ 09/May/11 ] |
|
Wouldn't that solution have the same race condition? Maybe after a network failure, we disallow take overs for 5 minutes? |