[SERVER-32593] CSRS stepdown during migration commit can trigger fassert on source shard primary Created: 08/Jan/18 Updated: 30/Oct/23 Resolved: 30/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.3, 3.7.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | Jack Mulrow |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v3.6
|
||||||||
| Sprint: | Sharding 2018-01-15, Sharding 2018-01-29, Sharding 2018-02-12 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 0 | ||||||||
| Description |
|
During the critical section, the source shard sends _configsvrCommitChunkMigration to the CSRS primary, but this can fail if the primary recently stepped down, causing the source shard to try to log "moveChunk.validating" on the CSRS primary to update its optime before refreshing metadata, and if this also fails, the source shard will fassert. From this comment, it seems that this is desired behavior, but it's a problem for the continuous stepdowns concurrency suite with the balancer enabled, since background migrations can crash servers and fail the test when the cluster is torn down. |
| Comments |
| Comment by Githook User [ 01/Feb/18 ] |
|
Author: {'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow', 'username': 'jsmulrow'}Message: (cherry picked from commit 153610cb7439546ef8897a2f4eda05b7fb50af5c) |
| Comment by Githook User [ 30/Jan/18 ] |
|
Author: {'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow', 'username': 'jsmulrow'}Message: |
| Comment by Kaloian Manassiev [ 09/Jan/18 ] |
|
It is already quasi-exception safe because the main state machine methods use a ScopedGuard to invoke the cleanup code. However, it is the cleanup code which is not exception-safe. So I think if we make that code not throw (by catching the exceptions), it should be sufficient. |
| Comment by Andy Schwerin [ 09/Jan/18 ] |
|
How hard would it be to make the source manager exception safe? |
| Comment by Kaloian Manassiev [ 09/Jan/18 ] |
|
Looks like it is not safe for code to throw in any of the MigrationSourceManager's cleanup methods. I think it is safe to ignore the stepdown errors, because we won't clear the migration flag as well, but we just need to make it not crash in that situation. |