Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-50955

oplog_rollover.js pauses the OplogCapMaintainerThread until truncation is needed

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4, v4.2
    • Sprint:
      Execution Team 2020-10-05
    • Linked BF Score:
      14

      Description

      There was a build failure related to this test where oplog truncation was running concurrently while inserting a record that should cause the oplog to rollover.

      Looking at the insertions the test performed we have the following oplog entries (some fields not shown due to the test using a projection):

      [ 
       { "op" : "i", "ns" : "test.foo", "ts" : Timestamp(1598009744, 3), "t" : NumberLong(1) },
       { "op" : "i", "ns" : "test.foo", "ts" : Timestamp(1598009745, 1), "t" : NumberLong(1) }, 
       { "op" : "i", "ns" : "test.foo", "ts" : Timestamp(1598009747, 1), "t" : NumberLong(1) } 
      ]
      

      The oplog truncation thread was truncating the oplog between RecordId's 0 and 6863399589169332000.
      These have RecordId's 6863399589169332227, 6863399593464299521 and 6863399602054234113 respectively.
      All of these have a RecordId higher then what the oplog truncation method was truncating, and so none of these oplog entries were truncated when it was expected for the first oplog entry to be truncated.

      The third record inserted that was supposed to roll over the oplog failed to create a stone, causing the test to hang as the OplogCapMaintainerThread saw nothing to reclaim.

      The test waits until there are two oplog entries remaining, but there were always three oplog entries in this run.

      From my observation, based on when the oplog truncation thread was running and when the third record was inserted, I think we tried to create a new oplog stone while oplog truncation was running. The oplog truncation thread can hold a mutex for a short amount of time when calling either peekOldestStoneIfNeeded() or popOldestStone() in the reclaimOplog() function.

      During this time, the third record insertion tried to create a new oplog stone but because of the possibility of the mutex being held by the oplog truncation thread, we returned early.

      This without a doubt is a transient issue as the subsequent insertions would try to create the oplog stone. But this test does not perform any other insertions and expects that oplog stone to be created no matter what.

        Attachments

          Activity

            People

            Assignee:
            gregory.wlodarek Gregory Wlodarek
            Reporter:
            gregory.wlodarek Gregory Wlodarek
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: