[SERVER-61444] Resharding uses of bumpCollectionVersionAndChangeMetadataInTxn are not idempotent Created: 12/Nov/21  Updated: 29/Oct/23  Resolved: 08/Feb/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.3.0, 5.2.1, 5.0.7

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Brett Nawrocki
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-67457 Resharding operation aborted in the m... Closed
related to SERVER-62072 _configsvrReshardCollection may retur... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v5.2, v5.0
Sprint: Sharding 2022-01-10, Sharding 2022-01-24, Sharding 2022-02-07, Sharding 2022-05-02
Participants:
Linked BF Score: 36
Story Points: 4

 Description   

This is because the usages was assuming that if an error occurred while running the function, it will be aborted so retrying can start with a clean slate. However, there is an edge case when commit was successful but the wait for write concern was interrupted. This can cause the function to assert even after successfully committing the changes.



 Comments   
Comment by Githook User [ 09/Feb/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-61444 Resharding coordinator state transactions now use w:1

Prior to updating its own in-memory state, the resharding coordinator
first runs a transaction to persist that state. There is an edge case
where that transaction (if run with >w:1) will commit successfully, but
become interrupted while waiting for replication. If that happens, the
coordinator will have completed the transaction's work, but fail to
update its own in-memory state, and therefore will redo that work when
it retries after handling the exception. Instead of running with the
default of w:majority, the transactions for these state transitions
have therefore been changed to use w:1 in order to avoid the
interruption edge case. An explicit wait for majority is added after
the transactions in cases where it must be majority committed before
proceeding.

(cherry picked from commit a710a2bf41118b848976502839590b66993bf512)
(cherry picked from commit 4e8f9344927e440f93681852e18b33319107f8f1)
Branch: v5.0
https://github.com/mongodb/mongo/commit/8d95e69b997f79cac5e6cfc86686fd4d8cb44b02

Comment by Githook User [ 09/Feb/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-61444 Resharding coordinator state transactions now use w:1

Prior to updating its own in-memory state, the resharding coordinator
first runs a transaction to persist that state. There is an edge case
where that transaction (if run with >w:1) will commit successfully, but
become interrupted while waiting for replication. If that happens, the
coordinator will have completed the transaction's work, but fail to
update its own in-memory state, and therefore will redo that work when
it retries after handling the exception. Instead of running with the
default of w:majority, the transactions for these state transitions
have therefore been changed to use w:1 in order to avoid the
interruption edge case. An explicit wait for majority is added after
the transactions in cases where it must be majority committed before
proceeding.

(cherry picked from commit a710a2bf41118b848976502839590b66993bf512)
(cherry picked from commit 4e8f9344927e440f93681852e18b33319107f8f1)
Branch: v5.2
https://github.com/mongodb/mongo/commit/3aaf548878d03e59b4006d352396060cfebc43f0

Comment by Githook User [ 08/Feb/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-61444 Fix race in resharding coordinator service unit test

Changes in the previous commit for SERVER-61444 refactored
resharding_coordinator_service_test.cpp to use functions for behavior
common between test cases. In doing so, the creation order for the
ReshardingCoordinatorService and the PauseDuringStateTransitions guard
was reversed. If the guard is created after the service, it is possible
that the service progresses through a state transition without the guard
having observed it. If this happens, when waiting for that state, the
guard will hang indefinitely instead of returning immediately. The guard
now is once again created before the service to resolve this issue.
Branch: master
https://github.com/mongodb/mongo/commit/4e8f9344927e440f93681852e18b33319107f8f1

Comment by Githook User [ 24/Jan/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-61444 Resharding coordinator state transactions now use w:1

Prior to updating its own in-memory state, the resharding coordinator
first runs a transaction to persist that state. There is an edge case
where that transaction (if run with >w:1) will commit successfully, but
become interrupted while waiting for replication. If that happens, the
coordinator will have completed the transaction's work, but fail to
update its own in-memory state, and therefore will redo that work when
it retries after handling the exception. Instead of running with the
default of w:majority, the transactions for these state transitions
have therefore been changed to use w:1 in order to avoid the
interruption edge case. An explicit wait for majority is added after
the transactions in cases where it must be majority committed before
proceeding.
Branch: master
https://github.com/mongodb/mongo/commit/a710a2bf41118b848976502839590b66993bf512

Generated at Thu Feb 08 05:52:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.