[SERVER-60599] Count command on mongoq handles TenantMigrationAborted Created: 11/Oct/21  Updated: 01/Nov/21  Resolved: 01/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Esha Maharishi (Inactive) Assignee: Mathis Bessa
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Sprint: Server Serverless 2021-10-11, Server Serverless 2021-10-18, Server Serverless 2021-10-25, Server Serverless 2021-11-01, Server Serverless 2021-11-15
Participants:

 Description   

ClusterCountCmd::run uses scatterGatherVersionedTargetByRoutingTable, which throws on routing errors. The routing error exceptions are caught in mongos's/mongoq's service entry point, where mongos/mongoq marks the appropriate routing cache entry as stale and retries the command by calling that function recursively until retries are exhausted.

This ticket should:

  • Update scatterGatherVersionedTargetByRoutingTable to also throw on TenantMigrationAborted
  • Update mongos's/mongoq's service entry point to recursively call run() on TenantMigrationAborted
  • Add a test that mongos/mongoq retries the count command internally on TenantMigrationAborted.
    • The test should be similar to the testRejectBlockedWritesAfterMigrationAborted test case in that it should:
      • set the pauseTenantMigrationBeforeLeavingBlockingState and abortTenantMigrationBeforeLeavingBlockingState failpoints on the replica set primary
      • start a tenant migration against the replica set primary
      • wait for the first failpoint to be hit
      • run the count command
      • wait for the replica set primary to report that it is blocking a read
      • disable the first failpoint to allow the count command to continue
      • assert that the count command worked
    • However, the test should run the count command against mongos/mongoq instead of against the replica set primary.


 Comments   
Comment by Esha Maharishi (Inactive) [ 01/Nov/21 ]

I misread the code originally: read commands will actually never return TenantMigrationAborted.

The main reason this was confusing was that TenantMigrationDonorAccessBlocker::getCanReadFuture and TenantMigrationDonorAccessBlocker::checkIfCanWrite return different futures that are set differently if the migration aborts:

If the migration is blocking reads, TenantMigrationDonorAccessBlocker::getCanReadFuture returns _transitionOutOfBlockingPromise, which is set with an ok status if the migration aborts.

If the migration is blocking writes, TenantMigrationDonorAccessBlocker::checkIfCanWrite throws TenantMigrationConflict, which isĀ caught in service_entry_point_common.cpp where it waits for _completionPromise, which is set with TenantMigrationAborted if the migration aborts.

We are also refactoring the TenantMigrationDonorAccessBlocker::getCanReadFuture function to improve its clarity under SERVER-61114.

Generated at Thu Feb 08 05:50:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.