Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-68372

Ensure operations are not interrupted with `NotWritablePrimary` in POS tests

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Internal Code
    • Labels:
      None
    • Service Arch
    • ALL
    • Hide
      • Add some delay (e.g., 5 seconds) before PrimaryOnlyService switches state from kRebuilding to kRunning (see here).
      • Build db_repl_test and run RecreateInstanceOnStepUp multiple times (e.g., 1000).
      Show
      Add some delay (e.g., 5 seconds) before PrimaryOnlyService switches state from kRebuilding to kRunning ( see here ). Build db_repl_test and run RecreateInstanceOnStepUp multiple times (e.g., 1000).
    • Service Arch 2022-08-08, Service Arch 2022-08-22, Service Arch 2022-09-05, Service Arch 2022-09-19
    • 5

      This may apply to other tests in primary_only_service_test.cpp, but at least one of the tests (i.e., RecreateInstanceOnStepUp) may fail due to a race between the thread that is completing stepUp for a POS instance and another thread that attempts to create an opCtx on the service (see here):

      try {
          auto opCtx = cc().makeOperationContext();
          ...
      } catch (const DBException& e) {
          _documentWriteException.setError(e.toStatus());
          throw;
      }
      

      The operations are interrupted with NotWritablePrimary if they are created when the POS instance is still rebuilding. Adding the following line before making the OperationContext fixes the data-race, but may not be desirable for the actual fix:

      try {
          AllowOpCtxWhenServiceRebuildingBlock allowOpCtxBlock(Client::getCurrent());
          auto opCtx = cc().makeOperationContext();
          ...
      } catch (const DBException& e) {
          _documentWriteException.setError(e.toStatus());
          throw;
      }
      

      This ticket should propose a fix that ensures the operations are not interrupted, either by strictly ordering stepUp and construction of opCtx, or using the earlier suggestion.

            Assignee:
            backlog-server-servicearch [DO NOT USE] Backlog - Service Architecture
            Reporter:
            amirsaman.memaripour@mongodb.com Amirsaman Memaripour
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: