[SERVER-47233] WriteOp can be left in pending state, leading to erroneous NoProgressMade write error from mongos Created: 01/Apr/20  Updated: 29/Oct/23  Resolved: 07/Apr/20

Status: Closed
Project: Core Server
Component/s: Sharding, Write Ops
Affects Version/s: 4.2.3
Fix Version/s: 4.2.6, 4.4.0-rc0, 4.0.19, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Marcos José Grillo Ramirez
Resolution: Fixed Votes: 0
Labels: PM-1645-Milestone-1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-32198 Missing collection metadata on the sh... Closed
Problem/Incident
is caused by SERVER-45100 Make the BatchWriteExecutor retry mul... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0
Sprint: Sharding 2020-04-20
Participants:
Linked BF Score: 17

 Description   

The changes from d591387 as part of SERVER-45100 made it so writes are no longer re-issued to shards we've already received a successful response from.

For updateOne by _id operations which target all shards that may own data for the collection, this can lead to a WriteOp being put in the WriteOpState_Pending state without scheduling more work on the ARS. Since BatchWriteOp::noteBatchResponse() must be called to transition the WriteOp to the WriteOpState_Completed or WriteOpState_Error states and that only happens when getting a response from the ARS, the WriteOp is left stranded in the WriteOpState_Pending state until it exhausts the numRoundsWithoutProgress counter.

  1. Collection is sharded with all chunks on shard0. (E.g. when using range-based sharding)
  2. Chunk migration from shard0 to shard1 begins (but committing it doesn't complete for the remainder of these steps).
  3. UpdateOne by _id operation targets both shard0 and shard1.
  4. MongoS receives acknowledgement of successful update from shard0.
  5. MongoS receives StaleShardVersion error response from shard1.
  6. MongoS re-targets and only considers resending updateOne by _id operation to shard0 because shard1 doesn't yet own any chunks.
  7. MongoS doesn't resend the updateOne by _id to shard0 because it has already got a successful response from shard0. WriteOp is put into the WriteOpState_Pending state without there being any more requests to send.
  8. MongoS reports a write result with a NoProgressMade error:

{
  ok: 1,
  nModified: 1,
  n: 1,
  writeErrors: [
    {
      index: 0,
      code: 82,
      codeName: "NoProgressMade",
      errmsg: "no progress was made executing batch write op in test2_fsmdb0.fsmcoll0 after 5 rounds (0 ops completed in 6 rounds total)"
    }
  ]
}



 Comments   
Comment by Githook User [ 08/Apr/20 ]

Author:

{'name': 'Marcos José Grillo Ramírez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-47233 Consider WriteOp complete when no shards left to target

(cherry picked from commit 2f6cac5d4b98969621b48decd2cadaed9e8faf14)
(cherry picked from commit 03a97185d0900fe46e86d404cb172f63644ef4c3)
(cherry picked from commit e1ac3e270cef236e9ff8630c8c388e0ea7dc2127)
Branch: v4.0
https://github.com/mongodb/mongo/commit/5a9f699eb3adced1fd4bc15ce8028d30bfe17fca

Comment by Githook User [ 08/Apr/20 ]

Author:

{'name': 'Marcos José Grillo Ramírez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-47233 Consider WriteOp complete when no shards left to target

(cherry picked from commit 2f6cac5d4b98969621b48decd2cadaed9e8faf14)
(cherry picked from commit 03a97185d0900fe46e86d404cb172f63644ef4c3)
Branch: v4.2
https://github.com/mongodb/mongo/commit/e1ac3e270cef236e9ff8630c8c388e0ea7dc2127

Comment by Githook User [ 07/Apr/20 ]

Author:

{'name': 'Marcos José Grillo Ramírez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-47233 Consider WriteOp complete when no shards left to target

(cherry picked from commit 2f6cac5d4b98969621b48decd2cadaed9e8faf14)
Branch: v4.4
https://github.com/mongodb/mongo/commit/03a97185d0900fe46e86d404cb172f63644ef4c3

Comment by Githook User [ 07/Apr/20 ]

Author:

{'name': 'Marcos José Grillo Ramírez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-47233 Consider WriteOp complete when no shards left to target
Branch: master
https://github.com/mongodb/mongo/commit/2f6cac5d4b98969621b48decd2cadaed9e8faf14

Comment by Marcos José Grillo Ramirez [ 02/Apr/20 ]

The assurances of the batch write executor were the following before SERVER-45100:

  • If there are no migrations then all write operations are executed exactly once on a sharded cluster
  • If a migration occur then all write operations can occur o, 1 or more than once on each shard within a sharded cluster

With SERVER-45100 we intended to preserve the first assurance in the presence of a SSV due to unknown metadata, but unfortunately we're causing an error where we shouldn't. We'll be adding a fix to this. I'm adding garaudy.etienne on the loop because this assurances should be known by product. Thanks to max.hirschhorn for investigating and delimiting this and kaloian.manassiev for helping me reason about this bug.

Generated at Thu Feb 08 05:13:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.