[SERVER-73916] Improve ReshardingTest fixture error reporting when reshardCollection has already failed before any failpoints are waited on Created: 11/Feb/23 Updated: 29/Oct/23 Resolved: 14/Feb/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding, Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 5.0.15, 7.0.0-rc0, 6.0.5 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Max Hirschhorn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Sharding NYC
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Backport Requested: |
v6.0, v5.0
|
||||||||||||||||
| Sprint: | Sharding NYC 2023-02-20 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 49 | ||||||||||||||||
| Description |
|
To avoid hanging or crashing the mongo shell process, the ReshardingTest fixture goes through some lengths to interrupt the reshardCollection command on mongos and join the background thread in the mongo shell which was running the reshardCollection command. However, after the changes from 0d5fd57 as part of |
| Comments |
| Comment by Githook User [ 15/Feb/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: Enables the ReshardingTest fixture to use the (cherry picked from commit 75ff482b93060c1c8a28f418dc55d16c4fcc02b6) | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 15/Feb/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: Enables the ReshardingTest fixture to use the (cherry picked from commit 75ff482b93060c1c8a28f418dc55d16c4fcc02b6) | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 14/Feb/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: Enables the ReshardingTest fixture to use the | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Abdul Qadeer [ 14/Feb/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes it does! Thank you for clarifying. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Max Hirschhorn [ 13/Feb/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
The waitForFailPoint command permits waiting for failpoints which are currently disabled. It won't return a "failpoint not found" error. For example, the following additional calls to the waitForFailPoint command all succeed because the timesEntered check is based on when the failpoint was first turned on and no the last observed value for the number of times the failpoint has been entered so far. Does this cover the case you were imagining?
| |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Abdul Qadeer [ 13/Feb/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
The idea looks good to me but in the patch it is possible that completionFailpoint.waitWithTimeout(1000) is called twice (_commandDoneSignal.getCount() is still 1 in the retry of assert and fp.waitWithTimeout(1000) returns false) and the second invocation may fail with "failpoint not found" error as we turn it off already the first time. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Max Hirschhorn [ 11/Feb/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
One idea would be to check whether the ReshardingCoordinator is blocked on the reshardingPauseCoordinatorBeforeCompletion failpoint and disable it to allow the _commandDoneSignal to be decremented. What do you think abdul.qadeer@mongodb.com?
|