[SERVER-67916] Race during stepdown can trigger invariant in ReshardingMetrics Created: 08/Jul/22  Updated: 29/Oct/23  Resolved: 18/Aug/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.0.9, 6.0.0-rc13
Fix Version/s: 5.0.13, 6.0.2, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Brett Nawrocki
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0, v5.0
Steps To Reproduce:

Running test_resharding_test_fixture_shutdown_retry_needed.js can reproduce this bug intermittently. Adding a sleep before this line helps in increasing the chances of reproducing the issue, but still not reliable enough.

Sprint: Sharding 2022-08-22
Participants:
Linked BF Score: 0
Story Points: 3

 Description   

When a step down occurs the recipient tries to wait for the _dataReplicationQuiesced future before deactivating the metrics state. That future is a composite of multiple futures and when any of the futures error out it would normally try to join them here. The issue is that during step down, the primary only service also shuts down the executor so the executor would refuse to run the onError and it would end up finishing the future chain without waiting for the other futures to complete.



 Comments   
Comment by Githook User [ 14/Sep/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-67916 Fix semantics of cancelWhenAnyErrorThenQuiesce

(cherry picked from commit 8ea624563847736c94f0e500d3097557ab4d8315)
Branch: v5.0
https://github.com/mongodb/mongo/commit/d711d9362842c971f36c4b14ab75488dc345700a

Comment by Githook User [ 13/Sep/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-67916 Fix semantics of cancelWhenAnyErrorThenQuiesce

(cherry picked from commit 8ea624563847736c94f0e500d3097557ab4d8315)
Branch: v6.0
https://github.com/mongodb/mongo/commit/fdf878f2d224ac2786fff8bafe003c2ef4cb5b1b

Comment by Max Hirschhorn [ 01/Sep/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-67916 Fix semantics of cancelWhenAnyErrorThenQuiesce
Branch: master
https://github.com/mongodb/mongo/commit/8ea624563847736c94f0e500d3097557ab4d8315

Comment by Max Hirschhorn [ 25/Jul/22 ]

Implementation plan is to add the cleanupExecutor into the resharding::cancelWhenAnyErrorThenQuiesce() for where the continuations run. The cleanup executor for primary-only service Instances isn't shut down until all of the Instances have returned a ready future from their run() method so should still be available to run the onError() continuation in resharding::cancelWhenAnyErrorThenQuiesce().

Comment by Randolph Tan [ 13/Jul/22 ]

Notes: shutdown is also called on the primary only service executor from repl coordinator shutdown.

Comment by Randolph Tan [ 08/Jul/22 ]

Note: This is no longer a problem in latest code due to the refactoring work done on resharding metrics.

Generated at Thu Feb 08 06:09:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.