[SERVER-32593] CSRS stepdown during migration commit can trigger fassert on source shard primary Created: 08/Jan/18  Updated: 30/Oct/23  Resolved: 30/Jan/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.6.3, 3.7.2

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Jack Mulrow
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.6
Sprint: Sharding 2018-01-15, Sharding 2018-01-29, Sharding 2018-02-12
Participants:
Linked BF Score: 0

 Description   

During the critical section, the source shard sends _configsvrCommitChunkMigration to the CSRS primary, but this can fail if the primary recently stepped down, causing the source shard to try to log "moveChunk.validating" on the CSRS primary to update its optime before refreshing metadata, and if this also fails, the source shard will fassert.

From this comment, it seems that this is desired behavior, but it's a problem for the continuous stepdowns concurrency suite with the balancer enabled, since background migrations can crash servers and fail the test when the cluster is torn down.

Example failure: https://evergreen.mongodb.com/task/mongodb_mongo_master_enterprise_rhel_62_64_bit_concurrency_sharded_with_stepdowns_and_balancer_WT_patch_b8f64cc3fde6d041f3e90b1cb2e153b0b15f6c47_5a20e4efe3c33173de00c68d_17_12_01_05_13_55



 Comments   
Comment by Githook User [ 01/Feb/18 ]

Author:

{'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow', 'username': 'jsmulrow'}

Message: SERVER-32593 Stepdown after failed moveChunk commit shouldn't crash source shard primary

(cherry picked from commit 153610cb7439546ef8897a2f4eda05b7fb50af5c)
Branch: v3.6
https://github.com/mongodb/mongo/commit/0f9d830a6228a82db66661f8869568e7db1105b7

Comment by Githook User [ 30/Jan/18 ]

Author:

{'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow', 'username': 'jsmulrow'}

Message: SERVER-32593 Stepdown after failed moveChunk commit shouldn't crash source shard primary
Branch: master
https://github.com/mongodb/mongo/commit/153610cb7439546ef8897a2f4eda05b7fb50af5c

Comment by Kaloian Manassiev [ 09/Jan/18 ]

It is already quasi-exception safe because the main state machine methods use a ScopedGuard to invoke the cleanup code. However, it is the cleanup code which is not exception-safe. So I think if we make that code not throw (by catching the exceptions), it should be sufficient.

Comment by Andy Schwerin [ 09/Jan/18 ]

How hard would it be to make the source manager exception safe?

Comment by Kaloian Manassiev [ 09/Jan/18 ]

Looks like it is not safe for code to throw in any of the MigrationSourceManager's cleanup methods. I think it is safe to ignore the stepdown errors, because we won't clear the migration flag as well, but we just need to make it not crash in that situation.

Generated at Thu Feb 08 04:30:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.