Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-47233

WriteOp can be left in pending state, leading to erroneous NoProgressMade write error from mongos

    • Fully Compatible
    • ALL
    • v4.4, v4.2, v4.0
    • Sharding 2020-04-20
    • 17

      The changes from d591387 as part of SERVER-45100 made it so writes are no longer re-issued to shards we've already received a successful response from.

      For updateOne by _id operations which target all shards that may own data for the collection, this can lead to a WriteOp being put in the WriteOpState_Pending state without scheduling more work on the ARS. Since BatchWriteOp::noteBatchResponse() must be called to transition the WriteOp to the WriteOpState_Completed or WriteOpState_Error states and that only happens when getting a response from the ARS, the WriteOp is left stranded in the WriteOpState_Pending state until it exhausts the numRoundsWithoutProgress counter.

      1. Collection is sharded with all chunks on shard0. (E.g. when using range-based sharding)
      2. Chunk migration from shard0 to shard1 begins (but committing it doesn't complete for the remainder of these steps).
      3. UpdateOne by _id operation targets both shard0 and shard1.
      4. MongoS receives acknowledgement of successful update from shard0.
      5. MongoS receives StaleShardVersion error response from shard1.
      6. MongoS re-targets and only considers resending updateOne by _id operation to shard0 because shard1 doesn't yet own any chunks.
      7. MongoS doesn't resend the updateOne by _id to shard0 because it has already got a successful response from shard0. WriteOp is put into the WriteOpState_Pending state without there being any more requests to send.
      8. MongoS reports a write result with a NoProgressMade error:
      {
        ok: 1,
        nModified: 1,
        n: 1,
        writeErrors: [
          {
            index: 0,
            code: 82,
            codeName: "NoProgressMade",
            errmsg: "no progress was made executing batch write op in test2_fsmdb0.fsmcoll0 after 5 rounds (0 ops completed in 6 rounds total)"
          }
        ]
      }
      

            Assignee:
            marcos.grillo@mongodb.com Marcos José Grillo Ramirez
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: