[SERVER-40713] Enable fsm workloads that use moveChunk in sharded stepdown suites Created: 18/Apr/19 Updated: 06/Dec/22 Resolved: 12/Nov/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | max-triage, pm-564 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Sharding
|
||||||||||||||||||||
| Sprint: | Sharding 2019-06-17, Sharding 2019-07-15, Sharding 2019-08-26 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
There are a few concurrency workloads that explicitly use moveChunk, which are currently disallowed in the concurrency stepdown suites because moveChunk is considered a non-retryable command by the network error retry override. Conceptually, moving a chunk is retryable, but it's likely it was disallowed because it can return non-retryable by default error codes if interrupted (e.g. OperationFailed if persisting critical section signal fails). To get more coverage of stepdowns concurrent with moveChunks, it should be possible to add special logic to the network override to handle the particular errors returned by moveChunk instead, similar to the workarounds for other operations that return inconsistent codes. This would be especially valuable for the workloads that move chunks while running transactions, like random_moveChunk_broadcast_update_transaction.js, random_moveChunk_broadcast_update_transaction.js, and agg_with_chunk_migrations.js when running in the concurrency_sharded_multi_stmt_txn_with_stepdowns suite. |
| Comments |
| Comment by Max Hirschhorn [ 12/Nov/21 ] |
|
The moveChunk FSM workloads were enabled in the stepdown suites as part of |
| Comment by Blake Oler [ 04/Sep/19 ] |
|
Deprioritizing this ticket, as Alex's ticket to implement a random balancer policy was pushed and will satisfy this constraint for the time-being. |
| Comment by Jack Mulrow [ 19/Apr/19 ] |
|
kaloian.manassiev, yeah the failures are because of interruptions from stepdowns, not transactions. They typically manifest as responses with OperationFailed and an error message with some retryable error buried inside like InterruptedDueToStepdown. I put this in the sharded transactions epic because while working on the transaction stepdown suites I realized we can do this to have better concurrency coverage of transactions with stepdowns and migrations, but this isn't a new problem. |
| Comment by Kaloian Manassiev [ 18/Apr/19 ] |
|
jack.mulrow, the reason for the moveChunk failures in these tests is not tied exclusively to transactions, right? It can fail for numerous other reasons even if there are no transactions, correct? How do these moveChunk failures manifest themselves in the test? |