Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-54460

Resharding may delete the state document before fully completing

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.9
    • Steps To Reproduce:
      Hide

      1. change the src/mongo/db/repl/primary_only_service_op_observer.cpp PrimaryOnlyServiceOpObserver::onDelete() to release the service with error:

      service->releaseInstance(
      documentId,
      Status(ErrorCodes::Interrupted,
      str::stream() << "State document " << documentId << " is dropped",
      BSON("documentId" << documentId)));

      2. run the test:

      buildscripts/resmoke.py run --suite=sharding --repeat 1 --mongodSetParameters="

      { featureFlagTenantMigrations: true}

      " jstests/sharding/api_params_nontransaction_sharded.js

      it will fail with:
      Error: command failed: {
      ...
      "errmsg" : "State document

      { _id: UUID(\"2e1c206c-d618-4f8c-ba0f-247637bea29c\") }

      is dropped",
      ...

      Show
      1. change the src/mongo/db/repl/primary_only_service_op_observer.cpp PrimaryOnlyServiceOpObserver::onDelete() to release the service with error: service->releaseInstance( documentId, Status(ErrorCodes::Interrupted, str::stream() << "State document " << documentId << " is dropped", BSON("documentId" << documentId))); 2. run the test: buildscripts/resmoke.py run --suite=sharding --repeat 1 --mongodSetParameters=" { featureFlagTenantMigrations: true} " jstests/sharding/api_params_nontransaction_sharded.js it will fail with: Error: command failed: { ... "errmsg" : "State document { _id: UUID(\"2e1c206c-d618-4f8c-ba0f-247637bea29c\") } is dropped", ...
    • Sprint:
      Sharding 2021-05-03, Sharding 2021-05-17
    • Story Points:
      2

      Description

      I do not claim that this issue can cause actual production failures, but it was a real problem for me blocking from fully implementing SERVER-53950.

      The idea I was trying to implement in SERVER-53950 was that we should always interrupt the primary service instance whenever we unregister it. One of the things that unregisters the service is the deletion of the state document.

      However if I make this bridge as discussed in that bug, the resharding fails at the moment the state doc is deleted, before completion. I don't see a simple fix myself.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              cheahuychou.mao Cheahuychou Mao
              Reporter:
              andrew.shuvalov Andrew Shuvalov
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: