[SERVER-67916] Race during stepdown can trigger invariant in ReshardingMetrics Created: 08/Jul/22 Updated: 29/Oct/23 Resolved: 18/Aug/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 5.0.9, 6.0.0-rc13 |
| Fix Version/s: | 5.0.13, 6.0.2, 6.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Randolph Tan | Assignee: | Brett Nawrocki |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v6.0, v5.0
|
||||||||
| Steps To Reproduce: | Running test_resharding_test_fixture_shutdown_retry_needed.js can reproduce this bug intermittently. Adding a sleep before this line helps in increasing the chances of reproducing the issue, but still not reliable enough. |
||||||||
| Sprint: | Sharding 2022-08-22 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 0 | ||||||||
| Story Points: | 3 | ||||||||
| Description |
|
When a step down occurs the recipient tries to wait for the _dataReplicationQuiesced future before deactivating the metrics state. That future is a composite of multiple futures and when any of the futures error out it would normally try to join them here. The issue is that during step down, the primary only service also shuts down the executor so the executor would refuse to run the onError and it would end up finishing the future chain without waiting for the other futures to complete. |
| Comments |
| Comment by Githook User [ 14/Sep/22 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: (cherry picked from commit 8ea624563847736c94f0e500d3097557ab4d8315) |
| Comment by Githook User [ 13/Sep/22 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: (cherry picked from commit 8ea624563847736c94f0e500d3097557ab4d8315) |
| Comment by Max Hirschhorn [ 01/Sep/22 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: |
| Comment by Max Hirschhorn [ 25/Jul/22 ] |
|
Implementation plan is to add the cleanupExecutor into the resharding::cancelWhenAnyErrorThenQuiesce() for where the continuations run. The cleanup executor for primary-only service Instances isn't shut down until all of the Instances have returned a ready future from their run() method so should still be available to run the onError() continuation in resharding::cancelWhenAnyErrorThenQuiesce(). |
| Comment by Randolph Tan [ 13/Jul/22 ] |
|
Notes: shutdown is also called on the primary only service executor from repl coordinator shutdown. |
| Comment by Randolph Tan [ 08/Jul/22 ] |
|
Note: This is no longer a problem in latest code due to the refactoring work done on resharding metrics. |