[SERVER-40713] Enable fsm workloads that use moveChunk in sharded stepdown suites Created: 18/Apr/19  Updated: 06/Dec/22  Resolved: 12/Nov/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Jack Mulrow Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Duplicate Votes: 0
Labels: max-triage, pm-564
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
duplicates SERVER-42192 Write a concurrency workload to test ... Closed
Related
is related to SERVER-42914 Implement random chunk selection poli... Closed
Assigned Teams:
Sharding
Sprint: Sharding 2019-06-17, Sharding 2019-07-15, Sharding 2019-08-26
Participants:

 Description   

There are a few concurrency workloads that explicitly use moveChunk, which are currently disallowed in the concurrency stepdown suites because moveChunk is considered a non-retryable command by the network error retry override. Conceptually, moving a chunk is retryable, but it's likely it was disallowed because it can return non-retryable by default error codes if interrupted (e.g. OperationFailed if persisting critical section signal fails).

To get more coverage of stepdowns concurrent with moveChunks, it should be possible to add special logic to the network override to handle the particular errors returned by moveChunk instead, similar to the workarounds for other operations that return inconsistent codes. This would be especially valuable for the workloads that move chunks while running transactions, like random_moveChunk_broadcast_update_transaction.jsrandom_moveChunk_broadcast_update_transaction.js, and agg_with_chunk_migrations.js when running in the concurrency_sharded_multi_stmt_txn_with_stepdowns suite.



 Comments   
Comment by Max Hirschhorn [ 12/Nov/21 ]

The moveChunk FSM workloads were enabled in the stepdown suites as part of SERVER-42192.

Comment by Blake Oler [ 04/Sep/19 ]

Deprioritizing this ticket, as Alex's ticket to implement a random balancer policy was pushed and will satisfy this constraint for the time-being.

SERVER-42914

Comment by Jack Mulrow [ 19/Apr/19 ]

kaloian.manassiev, yeah the failures are because of interruptions from stepdowns, not transactions. They typically manifest as responses with OperationFailed and an error message with some retryable error buried inside like InterruptedDueToStepdown. I put this in the sharded transactions epic because while working on the transaction stepdown suites I realized we can do this to have better concurrency coverage of transactions with stepdowns and migrations, but this isn't a new problem.

Comment by Kaloian Manassiev [ 18/Apr/19 ]

jack.mulrow, the reason for the moveChunk failures in these tests is not tied exclusively to transactions, right? It can fail for numerous other reasons even if there are no transactions, correct?

How do these moveChunk failures manifest themselves in the test?

Generated at Thu Feb 08 04:55:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.