[SERVER-6302] Race condition when multiple fsyncLock() invocations made on a mongod instance. Created: 04/Jul/12 Updated: 17/May/19 Resolved: 02/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency |
| Affects Version/s: | 2.1.2 |
| Fix Version/s: | 3.4.0-rc3 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Tyler Brock | Assignee: | James Wahlin |
| Resolution: | Done | Votes: | 4 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Sprint: | Integration 2016-10-31, Query 2016-11-21 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
There is currently no specification for multiple agents attempting to fsyncLock() a mongod instance. However, the observed behavior is somewhat confusing. Observed behavior, from a single mongo shell:
At first blush, this appears like the fsyncLock is a shared, recursive resource, and every acquisition must be paired with a release. However, this is misleading. What in fact happens is that the second acquisition returns, reporting success, if the first acquirer has the lock, and in the background on mongod, the lock request gets queued up. When the first acquisition is released with fsyncUnlock(), the pending lock request gets processed, and the fsync lock is re-acquired. There is a brief period where the fsync lock is unheld. We should:
|
| Comments |
| Comment by J Rassi [ 04/Nov/16 ] | ||||||||||||||||||||||||||||||||||||||
david.golden and I spoke in person. While we disagree about the exact standard that server changes should be held to in order to flag to downstream consumers in general, we do both agree that the standard for making unflagged user-visible changes should be much higher during the server RC process than during the normal development cycle. Most drivers have shipped their 3.4-ready release candidates already, and there's a real time/stress cost to doing last-minute releases to work around such changes from the server. | ||||||||||||||||||||||||||||||||||||||
| Comment by James Wahlin [ 04/Nov/16 ] | ||||||||||||||||||||||||||||||||||||||
|
I will send out a notification email. There is an addition made under this ticket that will allow for better testing as well. We added a "lockCount" field to both fsyncLock and fsyncUnlock response messages, which allows for evaluation of lock state in the presence of multiple clients performing locks & unlocks. The following demonstrates fsyncLock and fsyncUnlock response changes between MongoDB 3.2 and 3.4. 3.2.10
For 3.4.0-rc3
| ||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 04/Nov/16 ] | ||||||||||||||||||||||||||||||||||||||
|
While it's nice to think that the "answer" is for drivers not to depend on fields other than "ok" and "code", end users may also be depending on specific messages from the server, particularly errmsg, either in application code or in log monitoring. Anything the server sends out is effectively part of its API and changes should be judicious, really necessary and flagged. For reference, consider the Perl interpreter itself (which I'm just a bit familiar with). Release notes always include a list of changes to diagnostics because the maintainers have learned over decades that users always wind up looking for specific output strings (even if they shouldn't) and complain when their code breaks. If the volume of uassert changes is high, maybe batching them is the answer. | ||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 04/Nov/16 ] | ||||||||||||||||||||||||||||||||||||||
|
Personally, I don't agree that "Driver Changes: Maybe" and/or emails to downstream-changes should happen for tickets which don't affect the "code" or "ok" fields for a command response. Consider the scenario where an email to downstream-changes were sent for every ticket resolved which contained a commit that changes the value of "errmsg" in some command response. There are >1000 calls to the uassert() macro currently in master in the server respository; almost every uassert() call contains a unique error message, and I believe that >95% of these calls can bubble up to the command execution framework (causing the verbatim error message to be sent back to the user). I believe that hundreds of these calls are addded and removed in every server release, and, if such an email were sent for each one, the downstream-changes list would turn into a spam list. Furthermore, the quality of informational and error messages in the server certainly needs improvement, and I would love to keep the barrier low for server developers to improve them. Instead, I think that drivers should not depend on fields from the command response outside of "ok" and "code", except in rare circumstances. | ||||||||||||||||||||||||||||||||||||||
| Comment by David Golden [ 04/Nov/16 ] | ||||||||||||||||||||||||||||||||||||||
|
This commit broke tests for all versions of the Perl driver by changing the response message for "unlock completed". Please make sure that changes to database responses get flagged at least "Driver Changes Maybe" and notify the "downstream-changes" user list. | ||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 02/Nov/16 ] | ||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'jameswahlin', u'name': u'James Wahlin', u'email': u'james.wahlin@10gen.com'}Message: | ||||||||||||||||||||||||||||||||||||||
| Comment by Hiroaki [ 11/Sep/14 ] | ||||||||||||||||||||||||||||||||||||||
|
It seems like, this is the most important issue for the database product around safety and stability. | ||||||||||||||||||||||||||||||||||||||
| Comment by Thomas Rueckstiess [ 21/Jun/13 ] | ||||||||||||||||||||||||||||||||||||||
|
As a work-around for automated scripts, the currentOp can be inspected if the node is already locked, for example:
Note that in the unlocked case, $locked is undefined, not false. One should not proceed if the lock is already taken, and perhaps wait/retry after some time. |