Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-79810

make JournalFlusher::waitForJournalFlush() interruptible when waiting for write concern

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 7.1.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Storage Execution
    • Fully Compatible
    • ALL
    • v7.0
    • Hide

      The deadlock is outlined here but I've copy and pasted it for ease:

      1. A thread with setAllowMigrations (which checked out a session) waits for the changes to the metadata to be majority committed
      2. A stepdown thread takes the RSTL lock and tries to checkout the session of 1. to kill it
      3. Another thread with the JournalFlusher tries to take the RSTL lock taken by 2.
      Show
      The deadlock is outlined here but I've copy and pasted it for ease: A thread with  setAllowMigrations  (which checked out a session) waits for the changes to the metadata to be majority committed A stepdown thread takes the RSTL lock and tries to checkout the session of 1. to kill it Another thread with the  JournalFlusher  tries to take the RSTL lock taken by 2.
    • Execution NAMR Team 2023-08-21
    • 120

      When stepping down, we want caller threads to be able to know that the journal flusher was interrupted. Otherwise, we can get into a deadlock.

      Areas that may need to be addressed:

      1. All interruptions are retried, even ErrorCodes::InterruptedDueToReplStateChange. We want to be able to setError when the journal flusher is interrupted during stepdown.
      2. Even if we setError, we still get stuck in this infinite while loop retrying to flush the journal.
      3. The caller we're concerned about is writeConcern for the stepdown deadlock is writeConcern. We may want to waitForJournalFlusher without retrying and could introduce a new method for this. Or we may want to pass the writeConcern's opCtx to the journalFlusher so it can be interrupted.

      We should add a test for this deadlock so we can confirm fixing it and catching it early if it happens again.

            Assignee:
            benety.goh@mongodb.com Benety Goh
            Reporter:
            shinyee.tan@mongodb.com Shin Yee Tan
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: