Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-97558

Look for ways to improve 'getCurrentOps' codepath, to protect against race-condition related incorrect results

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: None
    • Query Integration
    • 0
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      In BF-35528 incorrect query results were returned by the $currentOp aggregation stage. Specifically an operation 'OplogWriter' was reported in the results as inactive, when the query asked for only active operations.

       

      naama.bareket@mongodb.com and I spent multiple days determining the root cause of this problem, which turned out to be a race-condition where, first the active status of the operation is checked, and then just after the operation is converted into a BSON representation that is returned to the user. It is possible that the status of the operation changes its active status between the check and the conversion to BSON, as the operation and the $currentOp code are running concurrently. Even though there is a lock in place on the housing "Client" object, the underlying objects "Client" points to are not locked.

       

      To specifically address BF-35528 in a reasonably time-bounded manner, we came up with this PR, which simply converts the operation to its BSON representation before checking the active status, so that there is no window for the active status to change between checking and including it in the results. This fix is of course not comprehensive to all possible errors of this type (just the active status check - which is better than nothing). When implementing this PR, I determined that there was no easy or quick way to generalize this strategy where the client/operation is first converted to BSON and then checked, as many of the checks are passed into helper functions that require the live Client object. Furthermore, there are multiple virtual functions, with different implementations, and the current op code path is generally complex. In short, this strategy used in BF-35528 is likely generalizable but the specific implementation will require more time and thought.

       

      This ticket is fairly open ended, and could possibly turn into a new series of more concrete tickets, an overall refactor of the getCurrentOps codepath, or something in-between based on the best judgment of the assignee and amount of time they can dedicate to it.

            Assignee:
            Unassigned Unassigned
            Reporter:
            joseph.shalabi@mongodb.com Joe Shalabi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              None
              None
              None
              None