Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.2.6, 4.4.0-rc0, 4.0.19, 4.7.0
Affects Version/s: 4.2.3
Component/s: Sharding, Write Ops
Labels:
- PM-1645-Milestone-1

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.4, v4.2, v4.0
Sprint:
Sharding 2020-04-20
Linked BF Score:
17
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The changes from d591387 as part of ~~SERVER-45100~~ made it so writes are no longer re-issued to shards we've already received a successful response from.

For updateOne by _id operations which target all shards that may own data for the collection, this can lead to a WriteOp being put in the WriteOpState_Pending state without scheduling more work on the ARS. Since BatchWriteOp::noteBatchResponse() must be called to transition the WriteOp to the WriteOpState_Completed or WriteOpState_Error states and that only happens when getting a response from the ARS, the WriteOp is left stranded in the WriteOpState_Pending state until it exhausts the numRoundsWithoutProgress counter.

Collection is sharded with all chunks on shard0. (E.g. when using range-based sharding)
Chunk migration from shard0 to shard1 begins (but committing it doesn't complete for the remainder of these steps).
UpdateOne by _id operation targets both shard0 and shard1.
MongoS receives acknowledgement of successful update from shard0.
MongoS receives StaleShardVersion error response from shard1.
MongoS re-targets and only considers resending updateOne by _id operation to shard0 because shard1 doesn't yet own any chunks.
MongoS doesn't resend the updateOne by _id to shard0 because it has already got a successful response from shard0. WriteOp is put into the WriteOpState_Pending state without there being any more requests to send.
MongoS reports a write result with a NoProgressMade error:

{
  ok: 1,
  nModified: 1,
  n: 1,
  writeErrors: [
    {
      index: 0,
      code: 82,
      codeName: "NoProgressMade",
      errmsg: "no progress was made executing batch write op in test2_fsmdb0.fsmcoll0 after 5 rounds (0 ops completed in 6 rounds total)"
    }
  ]
}

is caused by

SERVER-45100 Make the BatchWriteExecutor retry multi-writes only against unsuccessful shards

Closed

is depended on by

SERVER-32198 Missing collection metadata on the shard implies both UNSHARDED and "metadata not loaded yet"

Closed

Assignee:: Marcos José Grillo Ramirez
Reporter:: Max Hirschhorn
Participants:: Githook User, Marcos José Grillo Ramirez, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Apr 01 2020 06:35:14 PM UTC
Updated:: Oct 29 2023 10:10:07 PM UTC
Resolved:: Apr 07 2020 03:43:46 PM UTC
Confidence Status Last Update:: 02/Apr/20 9:43 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates