[SERVER-41463] repairDatabasesAndCheckVersion should not hold a lock while calling on StorageRepairObserver::onRepairDone Created: 03/Jun/19 Updated: 09/Oct/19 Resolved: 09/Oct/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Dianna Hohensee (Inactive) | Assignee: | Evgeni Dobranov |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | execution_intern_2019 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Sprint: | Execution Team 2019-10-21 | ||||||||
| Participants: | |||||||||
| Description |
|
StorageRepairObserver::onRepairDone calls StorageRepairObserver::_invalidateReplConfigIfNeeded , which writes out a document and then calls waitUntilDurable(). We must hold a lock for the data write, but should not for the waitUntilDurable() call, as we are trying to add an invariant against holding locks for a potentially long I/O call in waitUntilDurable() in repairDatabasesAndCheckVersion currently holds a global exclusive lock across all of its work. Perhaps elevate the waitUntilDurable() call to the repairDatabasesAndCheckVersion level, or even a level higher, somehow, outside the lock or while the lock is momentarily released. |
| Comments |
| Comment by Dianna Hohensee (Inactive) [ 09/Oct/19 ] | ||||||||||||||||
|
That sounds reasonable to me – the original solution was quite messy. Can you add a comment to | ||||||||||||||||
| Comment by Evgeni Dobranov [ 08/Oct/19 ] | ||||||||||||||||
|
louis.williams and I talked about the invariant aimed for by Our first thought was to just remove this call to waitUntilDurable(), but the program could theoretically fail right here. If we did that, we could get a state where the "repair incomplete file" is removed but the journal is never flushed. This would make it seem like repair succeeded when it didn't, and the replica set node would have no indication that repair ran and may attempt to rejoin the set. So removing this call to waitUntilDurable() is ruled out. We think it'd be best and simplest to add a special case for repair when adding the invariant forĀ
| ||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 01/Aug/19 ] | ||||||||||||||||
|
I looked into this a bit. A question that needs to be answered is whether we need to ensure that the StorageRepairObserver::onRepairDone reaches disk if we throw/error in the code that follows the that function call. I do not think it is safe to assume that none of the following code will throw: not future proof even if it turns out to be safe now. That answer informs whether we can add a block around the global lock acquisition and then call waitUntilDurable() after everything, if successful.
Alternatively, if it is safe to call waitUntilDurable after any error in repairDatabasesAndCheckVersion, we could avoid the lock scope by doing
Otherwise we are left with the solution of just making GlobalLock into boost::optional<GlobalLock> and releasing and reacquiring it around calling onRepairDone() and waitUntilDurable(), which would be elevated to the repairDatabasesAndCheckVersion function level. |