Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-54350

Investigate potential race conditions in SBE oplog plans

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Querying
    • Query Execution 2021-03-22, Query Execution 2021-05-03, Query Execution 2021-05-17, Query Execution 2021-05-31, Query Execution 2021-06-14

      In SBE, we generate special optimized plans in the case where we are running a collection scan on the oplog. A number of these plans involve either resolving a recordId in advance, or performing two separate scans within the same execution plan; one scan which checks some condition or produces some output, and a second "real" scan which uses that output as a parameter of its own execution. During development of SERVER-50580, we realised that it may be possible for these optimized oplog plans to behave incorrectly if entries fall off the oplog in the latency between the time we run the first part of the plan and the time we begin executing the "real" scan.

      Here, for instance, we resolve the recordId of the entry to which we want to skip before constructing the SBE plan, and then inject it into the plan as a constant value. But it may be possible that, in the time between the point at which we resolve the seekRecordId and the point at which we actually begin executing the scan, that record has fallen off the oplog. If this happens, we will incorrectly EOF the scan immediately.

      Similarly, here we create a NLJ whose outer branch scans the oplog until it reaches the first entry that matches our filter, and then passes that recordId to the inner branch, which continues the scan from that point without applying the filter to any subsequent entries. But if that recordId falls off the oplog between the time the first scan completes and the time the second scan begins (including, but potentially not limited to, the case where we yield at the wrong moment), we will again hit a spurious EOF. The same is true of the ASSERT_MIN_TS UnionStage plan proposed in SERVER-50580.

      We do not believe that tailable cursors in general are susceptible to this problem, despite using a two-scan plan, because the first scan always scans to EOF before passing the last observed recordId to the second scan. The user will have to issue a getMore before the second branch is executed; if the recordId has fallen off the capped collection by then, the plan will throw CappedPositionLost. Tailable awaitData cursors may be more susceptible, since they will continue to attempt to pull from the second branch after the first branch EOFs.

      The way to resolve the "double scan" scenario would be to incorporate the filtering performed by the first scan directly into the second, so that it is executed inline with the "real" scan; this means that there would be no inter-scan latency window during which entries could unexpectedly fall off the oplog. This solution would require a way to execute the filter only once, which could be implemented either by introducing a SegmentStage to generate a sequence of incrementing integer values, or by adding an "executeOnce" mode to the existing FilterStage. The issue caused by resolving the seekRecordId before building the plan would require some further thought, possibly pushing down the logic to obtain the recordId into the ScanStage as is in done in the classic engine.

      Before committing to this work, however, we should confirm that the scenarios above can actually arise. This may involve adding failpoints into SBE plans to cause them to freeze or yield at the appropriate moment, forcing the oplog to rollover, then allowing the plan to continue.

            andrew.paroski@mongodb.com Drew Paroski
            bernard.gorman@mongodb.com Bernard Gorman
            0 Vote for this issue
            8 Start watching this issue