[SERVER-34770] Retry on JavaScript execution interruptions in stepdown suites Created: 01/May/18 Updated: 29/Oct/23 Resolved: 14/Sep/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding, Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.10, 4.0.5, 4.1.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | Jack Mulrow |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||
| Sprint: | Sharding 2018-10-08 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 20 | ||||||||||||||||
| Description |
|
When a primary steps down, it interrupts all ongoing user operations. If a command is currently executing JavaScript (like an eval or find with $where), its MozJS scope will be marked as killed and return ErrorCodes::Interrupted. This is expected / desired behavior for a stepdown, so the stepdown passthroughs should be resilient to this error code / message combination. The auto_retry_on_network_error.js override should be updated to retry on this response. This is separate from |
| Comments |
| Comment by Githook User [ 20/Nov/18 ] |
|
Author: {'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}Message: (cherry picked from commit bb2de3700ee5b8eec9aa51cdbd2ecec937480c6c) |
| Comment by Githook User [ 12/Nov/18 ] |
|
Author: {'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}Message: (cherry picked from commit bb2de3700ee5b8eec9aa51cdbd2ecec937480c6c) |
| Comment by Jack Mulrow [ 02/Nov/18 ] |
|
The BF this ticket fixed can occur on 3.6 and 4.0 as well, so requesting a backport to those branches. |
| Comment by Jack Mulrow [ 14/Sep/18 ] |
|
Now that the JS scope returns the error code it was interrupted with, if a stepdown interrupts JS execution, the shell should choose to retry using the existing logic that retries on stepdown errors. So I think this ticket can be resolved. |
| Comment by Benjamin Caimano (Inactive) [ 17/Aug/18 ] |
|
Handing over to sharding. |
| Comment by Githook User [ 09/Jun/18 ] |
|
Author: {'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}Message: |
| Comment by Benjamin Caimano (Inactive) [ 24/May/18 ] |
|
On first glance, I believe that the repl coordinator actually does mark it as ErrorCodes::InterruptedDueToReplStateChange. However, there is a (somewhat reasonable) short circuit in the MozJS scope that just returns interrupted instead of the true code for all manner of kills. jack.mulrow, did you have a favorite repro strategy for the BF? |
| Comment by Kaloian Manassiev [ 11/May/18 ] |
|
We intentionally do not retry on ErrorCodes::Interrupted, because it is supposed to mean that an user killed the task. It seems to me that on stepdown interruption, the MozJS scope should be marked as InterruptedDueToReplSetStepdown (which will be retried). Unless I am mistaken - we should pass this on to the Platforms team. Not sure whether they should do that or Repl. |