[SERVER-41463] repairDatabasesAndCheckVersion should not hold a lock while calling on StorageRepairObserver::onRepairDone Created: 03/Jun/19  Updated: 09/Oct/19  Resolved: 09/Oct/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Evgeni Dobranov
Resolution: Won't Fix Votes: 0
Labels: execution_intern_2019
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-39591 RecoveryUnit::waitUntilDurable() shou... Closed
Sprint: Execution Team 2019-10-21
Participants:

 Description   

StorageRepairObserver::onRepairDone calls StorageRepairObserver::_invalidateReplConfigIfNeeded , which writes out a document and then calls waitUntilDurable(). We must hold a lock for the data write, but should not for the waitUntilDurable() call, as we are trying to add an invariant against holding locks for a potentially long I/O call in waitUntilDurable() in SERVER-39591.

repairDatabasesAndCheckVersion currently holds a global exclusive lock across all of its work. Perhaps elevate the waitUntilDurable() call to the repairDatabasesAndCheckVersion level, or even a level higher, somehow, outside the lock or while the lock is momentarily released.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 09/Oct/19 ]

That sounds reasonable to me – the original solution was quite messy. Can you add a comment to SERVER-39591 saying that we closed this ticket without fix and that we recommend using the invariant you describe to handle this issue?

Comment by Evgeni Dobranov [ 08/Oct/19 ]

louis.williams and I talked about the invariant aimed for by SERVER-39591 and how it specifically relates to repair.

Our first thought was to just remove this call to waitUntilDurable(), but the program could theoretically fail right here. If we did that, we could get a state where the "repair incomplete file" is removed but the journal is never flushed. This would make it seem like repair succeeded when it didn't, and the replica set node would have no indication that repair ran and may attempt to rejoin the set. So removing this call to waitUntilDurable() is ruled out.

We think it'd be best and simplest to add a special case for repair when adding the invariant forĀ SERVER-39591, rather than messing with the global lock. Perhaps something like invariant(!opCtx->lockState()->isLocked() || storageGlobalParams.repair). The reasoning for that and to address the concerns in the original ticket:

  1. The lock might cause other operations to block, but a server in repair mode typically doesn't place an emphasis on performance.
  2. A server in repair mode doesn't accept any incoming connections, so a ticketing deadlock wouldn't be possible under the circumstances for repair.
Comment by Dianna Hohensee (Inactive) [ 01/Aug/19 ]

I looked into this a bit. A question that needs to be answered is whether we need to ensure that the StorageRepairObserver::onRepairDone reaches disk if we throw/error in the code that follows the that function call. I do not think it is safe to assume that none of the following code will throw: not future proof even if it turns out to be safe now.

That answer informs whether we can add a block around the global lock acquisition and then call waitUntilDurable() after everything, if successful.

foo function() {
    returnValue;
    {
        GlobalLock ...
        ...
    }
    waitUntilDurable()
    return returnValue;
}

Alternatively, if it is safe to call waitUntilDurable after any error in repairDatabasesAndCheckVersion, we could avoid the lock scope by doing

try {
    GlobalLock ...
    ....
} catch (... e) {
    waitUntilDurable()
    throw e;
}

Otherwise we are left with the solution of just making GlobalLock into boost::optional<GlobalLock> and releasing and reacquiring it around calling onRepairDone() and waitUntilDurable(), which would be elevated to the repairDatabasesAndCheckVersion function level.

Generated at Thu Feb 08 04:57:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.