Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-44675

server_status_metrics.js fails due to racy repl.buffer.count metric in serverStatus

    • Fully Compatible
    • ALL
    • v4.2, v4.0, v3.6
    • Repl 2019-11-18, Repl 2019-12-02
    • 8

      The test fails (very rarely) due to a race in how the repl.buffer.count metric is calculated. There's a period when the rsBackgroundSync thread has added oplog entries to the buffer but hasn't yet incremented repl.buffer.count. During this period, the ReplBatcher thread can clear the buffer and decrement repl.buffer.count. Since the count can be decremented before it's incremented, it can be briefly negative. The server_status_metrics.js test doesn't expect this race.

      First, the test inserts 1000 docs with w: 2. The secondary's oplog buffer fills and empties, the metric is incremented by 1000 and decremented by 1000. The test calls serverStatus on the secondary and checks that repl.buffer.count >= 0, in fact it's 0, and the assertion passes. 

      Next, the test updates all 1000 docs with w: 2. Events proceed perhaps in this order:

      1. the rsBackgroundSync thread in BackgroundSync::_enqueueDocuments buffers 1000 oplog entries, bufferCountGauge is still 0
      2. the ReplBatcher thread in SyncTail::tryPopAndWaitForMore calls bufferCountGauge.decrement(1) a thousand times, now it's -1000
      3. the test calls serverStatus, repl.buffer.count is -1000 so the test will fail
      4. the rsBackgroundSync thread in BackgroundSync::_enqueueDocuments calls bufferCountGauge.increment(1000)

            Assignee:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Reporter:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: