[SERVER-36739] Use the mongos_manual_intervention_action hook in concurrency stepdown suites Created: 17/Aug/18 Updated: 29/Oct/23 Resolved: 24/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0, 4.4.2, 4.2.11, 4.0.22 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Janna Golden | Assignee: | Misha Tyulenev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Backport Requested: |
v4.4, v4.2, v4.0
|
||||||||
| Sprint: | Sharding 2020-04-20, Sharding 2020-05-04 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 8 | ||||||||
| Description |
|
Tests can fail in both concurrency_sharded_with_stepdowns and concurrency_sharded_with_stepdowns_and_balancer because of a failed shardCollection leaving partially written chunks. We should use the mongos_manual_intervention_action hook in both of these suites to catch and clean up this error. |
| Comments |
| Comment by Githook User [ 30/Oct/20 ] | ||
|
Author: {'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}Message: (cherry picked from commit c4873acda56712bba29f1ce1f81c6e8dc873669a) | ||
| Comment by Githook User [ 30/Oct/20 ] | ||
|
Author: {'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}Message: (cherry picked from commit c4873acda56712bba29f1ce1f81c6e8dc873669a) | ||
| Comment by Githook User [ 30/Oct/20 ] | ||
|
Author: {'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}Message: (cherry picked from commit c4873acda56712bba29f1ce1f81c6e8dc873669a) | ||
| Comment by Githook User [ 24/Apr/20 ] | ||
|
Author: {'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}Message: | ||
| Comment by Janna Golden [ 22/Aug/18 ] | ||
|
max.hirschhorn, actually the command is sent with `RetryPolicy::kIdempotent`so we already retry. | ||
| Comment by Max Hirschhorn [ 22/Aug/18 ] | ||
I thought we had gotten further in running the test than we did because the error message says "A previous attempt to shard collection test23_fsmdb0.fsmcoll0" (emphasis mine) but the concurrency framework would only attempt to run the shardCollection command once for a particular namespace.
I've included the relevant log messages below. Is it not possible to retry the _shardsvrShardCollection command on NotMaster error responses?
It is somewhat obnoxious that we jump through hoops in our testing infrastructure to ensure after each test in the stepdown suites that mongos knows the current primary of the CSRS and all the replica set shards, but the CSRS has exactly the same problem. | ||
| Comment by Janna Golden [ 22/Aug/18 ] | ||
|
max.hirschhorn - the shardCollection command doesn't return success, so it's retried. In the particular case of the BF linked, the primary shard (shard 0) sends cloneCollectionOptionsFromPrimaryShard() to shard 1 after creating chunks on itself, which fails with a "NotMaster" error thus failing the shardCollection command and leaving partially written chunks. | ||
| Comment by Max Hirschhorn [ 22/Aug/18 ] | ||
|
janna.golden, the concurrency framework runs the shardCollection command in the main thread before permitting stepdowns. Why would the _shardsvrShardCollection command be getting run multiple times after the shardCollection command returns success? |