[SERVER-83829] Mongos can incorrectly report WCE after a retry of WriteWithoutShardKeyWithID writes despite succeeding in second try Created: 02/Dec/23  Updated: 10/Jan/24  Resolved: 10/Jan/24

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.3.0-rc0

Type: Bug Priority: Major - P3
Reporter: Abdul Qadeer Assignee: Abdul Qadeer
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Cluster Scalability
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Cluster Scalability 2024-1-8, Cluster Scalability 2024-1-22
Participants:

 Description   

Under PM-3190, as reported by kaitlin.mahar@mongodb.com verbatim:

  1. we have two shards, shardA and shardB, and send the write to them both
  2. shardA responds  {ok: 1, n:0, writeConcernError: {…}}  and we save that write concern error
  3. shardB responds with a StaleConfig error for the write.
  4. given the StaleConfig error, we consider the response we got from shardA in step 2 to be out of date/invalidated, and so we retarget all shards anew.
  5. after retargeting, say we get {{ {ok:1, n:1}

    }} back from shardA and {{

    {ok:1, n:0}

    }} back from shardB

  6. the final response, perhaps confusingly, is still going to contain the write concern error from step 2, even though we consider that write irrelevant at this point and ended up redoing it (and didn’t get a WC error the second time.)

 



 Comments   
Comment by Githook User [ 10/Jan/24 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-83829 Defer WCE errors for WriteWithoutShardKeyWithId (#17930)

GitOrigin-RevId: 2a53d7cb226d1912f154aa59563759148795f28a
Branch: master
https://github.com/mongodb/mongo/commit/04f19bb61aba10577658947095020f00ac1403c4

Comment by Kaitlin Mahar [ 04/Dec/23 ]

I am fixing this on the bulkWrite codepath in SERVER-83463, however a similar fix will be needed in the batch_write_exec codepath taken by the update and delete commands.

Comment by Abdul Qadeer [ 02/Dec/23 ]

This scenario is possible for no-op writes. A solution would be to mark the WCE in a broadcast and later clear it if any further retry has to happen for the write.

Generated at Thu Feb 08 06:53:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.