fsyncLock leaves durableOpTime stuck behind lastWritten, hanging snapshot/majority reads

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Execution
    • ALL
    • Hide

      Deterministic single-node reproducer attached as
      fsync_lock_durable_optime_stuck.js. The test:

      1. Pauses JournalFlusher via the pauseJournalFlusherThread failpoint to keep
        the danger window open indefinitely.
      2. Issues a single w:1, j:false insert. The WUOW commits in memory, the oplog
        entry is applied, but the journal is not fsynced (because JournalFlusher is
        paused).
      3. Calls db.fsyncLock().
      4. Surfaces the wedged optimes via replSetGetStatus.
      5. Issues a snapshot read with afterClusterTime at the inserted entry's
        operationTime. The read hangs forever.

      The test is currently expected to fail (timeout) on master.

      Sample output of the wedged state:

        writtenOpTime:       Timestamp(1778182401, 2)
        appliedOpTime:       Timestamp(1778182401, 2)
        durableOpTime:       Timestamp(1778182399, 1)   // stuck
        lastCommittedOpTime: Timestamp(1778182399, 1)   // stuck
        

      resmoke output:

        fsync_lock_durable_optime_stuck.js ran in 25.29 seconds: timeout.
        The following tests failed (with exit code):
            jstests/noPassthrough/fsync_lock_durable_optime_stuck.js (-9 SIGKILL)
        
      Show
      Deterministic single-node reproducer attached as fsync_lock_durable_optime_stuck.js . The test: Pauses JournalFlusher via the pauseJournalFlusherThread failpoint to keep the danger window open indefinitely. Issues a single w:1, j:false insert. The WUOW commits in memory, the oplog entry is applied, but the journal is not fsynced (because JournalFlusher is paused). Calls db.fsyncLock() . Surfaces the wedged optimes via replSetGetStatus . Issues a snapshot read with afterClusterTime at the inserted entry's operationTime . The read hangs forever. The test is currently expected to fail (timeout) on master . Sample output of the wedged state: writtenOpTime: Timestamp(1778182401, 2) appliedOpTime: Timestamp(1778182401, 2) durableOpTime: Timestamp(1778182399, 1) // stuck lastCommittedOpTime: Timestamp(1778182399, 1) // stuck resmoke output: fsync_lock_durable_optime_stuck.js ran in 25.29 seconds: timeout. The following tests failed (with exit code): jstests/noPassthrough/fsync_lock_durable_optime_stuck.js (-9 SIGKILL)
    • Storage Execution 2026-05-11, Storage Execution 2026-05-25
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      If an oplog entry is in the "applied but not yet durable" window on a primary at the moment fsyncLock acquires Global S, the in-memory durableOpTime stays stuck behind lastWritten for the entire lifetime of the lock. lastCommittedOpTime cannot advance past the stuck timestamp, so any snapshot or majority read with afterClusterTime past the in-flight entry hangs indefinitely until fsyncUnlock is called. The bug applies to any primary, any secondary count, any storage workload.

      Symptoms

      When triggered the producer of the in-flight entry is whichever asynchronous primary-side oplog writer happens to commit a WUOW in the few-hundred-microsecond gap between commit and the next periodic journal fsync. Externally visible effects on a hung primary:

      • $currentOp on system threads:
          JournalFlusher             waitingForLock: true   locks: {Global: "w"}
          OplogCapMaintainerThread   waitingForLock: true   locks: {Global: "w"}
          ChangeStreamExpiredPreImagesRemover  waitingForLock: true   locks: {Global: "w"}
          fsyncLockWorker   waitingForLock: false  locks: {Global: "R"}
          
      • appliedOpTime == writtenOpTime ahead of {{durableOpTime ==
        lastCommittedOpTime}}.
      • User snapshot or majority reads with afterClusterTime past durableOpTime
        hang for the duration of the lock.
      • Operations attempting any IX on Global block (writes, j:true/majority
        writes, NoopWriter bails out via its 1 ms timeout).

      The hang is a stable equilibrium: nothing on the primary can advance durableOpTime while Global S is held, so the state persists until fsyncUnlock is called.

        1. fsync_lock_durable_optime_stuck.js
          6 kB
          Ernesto Rodriguez Reina

            Assignee:
            Ernesto Rodriguez Reina
            Reporter:
            Ernesto Rodriguez Reina
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: