[SERVER-74647] Resharding state machine creation should be retried after interruption Created: 06/Mar/23  Updated: 29/Oct/23  Resolved: 29/Mar/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.0.15, 6.2.0-rc6, 6.0.5, 6.2.1, 6.3.0-rc2
Fix Version/s: 7.0.0-rc0, 6.0.6, 5.0.17

Type: Bug Priority: Major - P3
Reporter: Tommaso Tocci Assignee: Brett Nawrocki
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Assigned Teams:
Sharding NYC
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.3, v6.2, v6.0, v5.0
Participants:
Linked BF Score: 105

 Description   

createReshardingStateMachine is the function in charge of:

  1. Writing the state machine document on disk
  2. Creating the related POS instance

Unfortunately this function is not idempotent. In fact if during the first execution the opCtx gets interrupted between (1.) and (2.) on subsequent executions the function will try to execute (1.) it will fail with a DuplicateKey error and it will not execute (2.). Thus in this scenario the state machine document will be written on disk but the POS instance for the recipient/donor won't be actually installed and executed leaving the resharding operation in an "hang" state.

The createReshardingStateMachine is called as part of shard version recovery procedure, the operation context of this procedure is interrupted every time some thread enter the collection critical for instance as part of a chunk migration.

One possible solution would be to attempt the creation of the POS instance even in case we hit the DuplicateKey error on insertion.



 Comments   
Comment by Githook User [ 30/Mar/23 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-74647 Retry create resharding state machine on interrupt

(cherry picked from commit c6fbd4ae07365389aa544f28e718eecf740604c7)
Branch: v5.0
https://github.com/mongodb/mongo/commit/9df7ed07270f5e3bfe88a1d1566a98f83e41c8f7

Comment by Githook User [ 30/Mar/23 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-74647 Retry create resharding state machine on interrupt

(cherry picked from commit c6fbd4ae07365389aa544f28e718eecf740604c7)
Branch: v6.0
https://github.com/mongodb/mongo/commit/751788fcf6e2b87432958c50a2cada6aee1643e8

Comment by Githook User [ 29/Mar/23 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-74647 Retry create resharding state machine on interrupt
Branch: master
https://github.com/mongodb/mongo/commit/c6fbd4ae07365389aa544f28e718eecf740604c7

Generated at Thu Feb 08 06:28:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.