Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-43399

Add logging to DeleteOpIsIdBased test to debug rare test hang in waitForAllEarlierOplogWritesToBeVisible

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.3.1
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Fully Compatible
    • ALL
    • Repl 2019-10-07, Repl 2019-10-21
    • 13

      A hang was observed in the DeleteOpIsIdBased unittest in repltests.cpp. The test performs several deletes (which create delete oplog entries) and immediately queries the oplog, triggering a call to waitForAllEarlierOplogWritesToBeVisible. The stack trace is approximately:

      Thread 1: "testsuite" (Thread 0x7fdf3e0f7ac0 (LWP 67799))
       .
      #10 0x000055f6e9f94225 in mongo::Interruptible::waitForConditionOrInterrupt
      #11 mongo::WiredTigerOplogManager::waitForAllEarlierOplogWritesToBeVisible
       .
      #16 0x000055f6eb3e54fb in mongo::(anonymous namespace)::FindCmd::Invocation::run
       .
      #28 0x000055f6e97c822d in ReplTests::Base::applyAllOperations
      #29 0x000055f6e9824b59 in ReplTests::DeleteOpIsIdBased::run 

      This hang was observed approximately once in Evergreen. It seems likely to be a race involving the WTOplogJournalThread and the main thread, where the main thread is expecting the WTOplogJournalThread to call _setOplogReadTimestamp but it already has / never does. As lingzhi.deng showed me, it may be because waitForAllEarlierOplogWritesToBeVisible increments _opsWaitingForVisibility tell this thread that someone is waiting for it, but the thread checks a different member, _opsWaitingForJournal, to determine if there are any waiters.

            Assignee:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Reporter:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: