Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-53466

Race between PrimaryOnlyService::stepDown and _rebuildInstances

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Service Arch
    • ALL
    • 29

      The PrimaryOnlyService stores a list of operation contexts running on its associated Client threads. When the host running the service steps down, and PrimaryOnlyService::onStepDown is called, each operation context in the list is killed here.

      However, if another thread is currently managing the step-up process when stepDown is called, it's possible another thread is in the middle of running PrimaryOnlyService::_rebuildInstances. In this thread, a new operation context associated with the POS is created here, and registered with the POS (i.e. inserted into it's _opCtxs member) by the hooks in the PrimaryOnlyServiceClientObserver here. If this operation context goes out of scope while another thread runs onStepDown/tries to kill it, there will be a race between the killing thread reading the operationContext's _baton member here and the thread in which it has fallen out of scope writing the value of _baton here in the chain of calls starting with the opCtx's destructor.

      To fix this, we could consider:
      running the PrimaryOnlyServiceClientObserver's cleanup hooks, which will remove the opCtx from the POS's list, before allowing the opCtx destructor to modify any of it's state (i.e. switch the call to opCtx->getBaton->detach() with the line invoking the hooks here).

            Assignee:
            backlog-server-servicearch [DO NOT USE] Backlog - Service Architecture
            Reporter:
            george.wangensteen@mongodb.com George Wangensteen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: