[SERVER-72413] Split may return an error when it is actually committed Created: 28/Dec/22  Updated: 29/Oct/23  Resolved: 29/Dec/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 6.1.0-rc4, 6.2.0-rc4
Fix Version/s: 6.3.0-rc0

Type: Bug Priority: Major - P3
Reporter: Silvia Surroca Assignee: Silvia Surroca
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
is caused by SERVER-65838 Remove applyOpsDeprecated usage from ... Closed
Related
related to SERVER-71649 Transaction API shouldn't block on an... Closed
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.2
Sprint: Sharding EMEA 2022-12-26
Participants:
Linked BF Score: 147

 Description   

A split might return an error if the command _configsvrCommitChunkSplit fails waiting for majority when the split was already committed.

Here is an example of when this can happen:

1) A split is requested to mongos and it's resending it to the shard chunk owner.
2) The shard issues a _configsvrCommitChunkSplit to the config server.
3) The config server steps down when the commit was already committed but was still waiting for majority write.
4) The shard receives an InterruptedDueToReplStateChange error from the CS and checks if the split was actually committed after refreshing its cache.
5) Since the refresh may be done against a config server node that still doesn't have the latest chunk version, so the shard still believes that the split was not done and retries the command _configsvrCommitChunkSplit.
6) The new config server primary node fails here on this query because the split was actually done.



 Comments   
Comment by Githook User [ 29/Dec/22 ]

Author:

{'name': 'Silvia Surroca', 'email': 'silvia.surroca@mongodb.com', 'username': 'silviasuhu'}

Message: SERVER-72413 Split may return an error when it is actually committed
Branch: master
https://github.com/mongodb/mongo/commit/e836af153c045bee380646c9c8f3715cabfe73ed

Comment by Silvia Surroca [ 29/Dec/22 ]

Requesting backport to v6.2 since bug was introduced by SERVER-65838 in v6.1 without backports

Comment by Silvia Surroca [ 29/Dec/22 ]

The issue was caused by SERVER-65838 because, moving from applyOps to internal transactions, we stopped updating to the lastOpTime when the split precondition fails. applyOps was implicitly updating to the lastOpTime of the opCtx, while the transaction doesn't if it fails.
SERVER-65838 was introduced in v6.1.

However, the error started appearing after SERVER-71649 because it reduces the time waiting a transaction abort.

Generated at Thu Feb 08 06:21:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.