Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-60775

PrimaryOnlyService won't wait for prior Instance on step up if InstanceID had completed on an interceding primary already (ABA problem)

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 5.0.0, 5.1.0-rc0
    • Component/s: None
    • Service Arch
    • ALL
    • 1

      PrimaryOnlyService::onStepUp() has logic to wait for any Instances constructed in a previous term have finished executing before constructing any new Instances in a higher term.

      // This ensures that all instances from previous term have joined.
      for (auto& instance : savedInstances) {

      The logic in PrimaryOnlyService::onStepUp() only applies to Instances which are still tracked in PrimaryOnlyService::_activeInstances. When the state document for the Instance is removed, the Instance is also removed from PrimaryOnlyService::_activeInstances. However, PrimaryOnlyServiceOpObserver::onDelete() also run on secondaries as part of oplog application.

      This leads to a situation where an Instance can be constructed in term 7 despite an (untracked) Instance with a different ID from term 5 not having its future returned by run() become ready. I suspect the solution here is to have PrimaryOnlyServiceOpObserver check whether the current node is primary when doing the delete/drop before removing the ActiveInstance from the map.


      Acceptance criteria: Investigate the root cause and propose possible solutions for triage. 

            backlog-server-servicearch Backlog - Service Architecture
            max.hirschhorn@mongodb.com Max Hirschhorn
            0 Vote for this issue
            2 Start watching this issue