[SERVER-61482] Updates to config.reshardingOperations wait for PrimaryOnlyService to be rebuilt while holding oplog slot, stalling replication on config server indefinitely Created: 15/Nov/21  Updated: 29/Oct/23  Resolved: 17/Nov/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0, 5.1.0
Fix Version/s: 5.2.0, 5.0.5, 5.1.1

Type: Bug Priority: Critical - P2
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
is caused by SERVER-49572 Implement onReshardingParticipantTran... Closed
Related
related to SERVER-61483 Resharding coordinator fails to recov... Closed
related to SERVER-61607 Accept DuplicateKey as a possible err... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.1, v5.0
Sprint: Sharding 2021-11-29
Participants:
Linked BF Score: 159
Story Points: 2

 Description   

PrimaryOnlyService::onStepUp() waits for stepUpOpTime to become majority-committed before attempting to rebuild any Instances. New optimes becoming majority-committed depend on the ability for secondaries to successfully read new entries from a forward-scanning oplog cursor, which in turn depend on there not being any outstanding storage transactions with oplog slots still reserved (aka a hole in the oplog).

ReshardingOpObserver::onUpdate() attempts to get the ReshardingCoordinator and its associated ReshardingCoordinatorObserver to update their in-memory states. Doing so must wait until the ReshardingCoordinatorService has finished rebuilding. However, ReshardingOpObserver::onUpdate() currently waits for the ReshardingCoordinatorService to have finished rebuilding with its storage transaction still active and after having acquired an oplog slot for the update to config.reshardingOperations. If the ReshardingCoordinatorService wasn't already rebuilt before the update to the config.reshardingOperations collection came in from the donor or recipient shard, then it won't ever finish rebuilding. Nor will replication on the config server be able to make progress with the oplog hole present.



 Comments   
Comment by Githook User [ 17/Nov/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-61482 Avoid resharding holding open an oplog hole on configsvr.

(cherry picked from commit 2d3592d80fb851e00b519bfeeea513aaff6d0c95)
Branch: v5.0
https://github.com/mongodb/mongo/commit/963c540bdc58509e4938f2d5981f13c66e012b59

Comment by Githook User [ 17/Nov/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-61482 Avoid resharding holding open an oplog hole on configsvr.

(cherry picked from commit 2d3592d80fb851e00b519bfeeea513aaff6d0c95)
Branch: v5.1
https://github.com/mongodb/mongo/commit/a96a30f357990094f7e7b464cc143cab05716238

Comment by Githook User [ 16/Nov/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-61482 Avoid resharding holding open an oplog hole on configsvr.
Branch: master
https://github.com/mongodb/mongo/commit/2d3592d80fb851e00b519bfeeea513aaff6d0c95

Generated at Thu Feb 08 05:52:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.