[SERVER-68372] Ensure operations are not interrupted with `NotWritablePrimary` in POS tests Created: 27/Jul/22  Updated: 05/Dec/22

Status: Backlog
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Amirsaman Memaripour Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Assigned Teams:
Service Arch
Operating System: ALL
Steps To Reproduce:
  • Add some delay (e.g., 5 seconds) before PrimaryOnlyService switches state from kRebuilding to kRunning (see here).
  • Build db_repl_test and run RecreateInstanceOnStepUp multiple times (e.g., 1000).
Sprint: Service Arch 2022-08-08, Service Arch 2022-08-22, Service Arch 2022-09-05, Service Arch 2022-09-19
Participants:
Linked BF Score: 5

 Description   

This may apply to other tests in primary_only_service_test.cpp, but at least one of the tests (i.e., RecreateInstanceOnStepUp) may fail due to a race between the thread that is completing stepUp for a POS instance and another thread that attempts to create an opCtx on the service (see here):

try {
    auto opCtx = cc().makeOperationContext();
    ...
} catch (const DBException& e) {
    _documentWriteException.setError(e.toStatus());
    throw;
}

The operations are interrupted with NotWritablePrimary if they are created when the POS instance is still rebuilding. Adding the following line before making the OperationContext fixes the data-race, but may not be desirable for the actual fix:

try {
    AllowOpCtxWhenServiceRebuildingBlock allowOpCtxBlock(Client::getCurrent());
    auto opCtx = cc().makeOperationContext();
    ...
} catch (const DBException& e) {
    _documentWriteException.setError(e.toStatus());
    throw;
}

This ticket should propose a fix that ensures the operations are not interrupted, either by strictly ordering stepUp and construction of opCtx, or using the earlier suggestion.



 Comments   
Comment by Matt Diener (Inactive) [ 20/Sep/22 ]

Moving out of the sprint and attaching to the POS Improvements epic. This task requires some serious design and should be scoped to the project, as state management during stepdown/stepup is one of the biggest pain points facing POS today.

Generated at Thu Feb 08 06:10:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.