-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Sharding
-
Fully Compatible
-
v4.4
-
Sharding 2020-03-23
-
39
On step up, a node runs resumeMigrationCoordinationsOnStepUp which reads docs in the migrationCoordinators collection in order to recover any migrations. The following race can happen if there is a moveChunk where the source is the node stepping up run quickly after the stepdown/stepup command:
1. node 1 in shard A is stepping up
2. move chunk from shard A to shard B starts
3. node 1 shard A writes the migration coordinator doc from the move chunk in step 2
4. node 1 shard A is running resumeMigrationCoordinationsOnStepUp and finds the doc written in #3 and starts that moveChunk again
5. node 1 aborts the migration
This isn't a correctness bug, but causes migrations to abort unnecessarily. Below are log lines showing this scenario:
[js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.402+0000 d20023| 2020-03-05T23:47:06.402+0000 D2 MIGRATION 23889 [MoveChunk] "Starting chunk migration {args} with expected collection version epoch {args_getVersionEpoch}","attr":{"args":"ns: test.findAndMod-update-postImage, [{ x: MinKey }, { x: MaxKey }), fromShard: move_chunk_find_and_modify_with_write_retryability-rs1, toShard: move_chunk_find_and_modify_with_write_retryability-rs0","args_getVersionEpoch":{"$oid":"5e618f62bcc83a106428cd17"}} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.402+0000 d20023| 2020-03-05T23:47:06.402+0000 D2 MIGRATION 23889 [MoveChunk] "Persisting migration coordinator doc","attr":{"migrationDoc":{"_id":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"},"migrationSessionId":"move_chunk_find_and_modify_with_write_retryability-rs1_move_chunk_find_and_modify_with_write_retryability-rs0_5e618f7a444eb86b3f2befca","lsid":{"id":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"txnNumber":2,"nss":"test.findAndMod-update-postImage","collectionUuid":{"$uuid":"e46a8870-d22b-43ed-b91f-142265d8be10"},"donorShardId":"move_chunk_find_and_modify_with_write_retryability-rs1","recipientShardId":"move_chunk_find_and_modify_with_write_retryability-rs0","range":{"min":{"x":{"$minKey":1}},"max":{"x":{"$maxKey":1}}},"preMigrationChunkVersion":{"0":{"$timestamp":{"t":2,"i":0}},"1":{"$oid":"5e618f62bcc83a106428cd17"}}}} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.792+0000 d20023| 2020-03-05T23:47:06.792+0000 I TXN 22452 [TransactionCoordinator] "Need to resume coordinating commit for {coordinatorDocs_size} transactions","attr":{"coordinatorDocs_size":0} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.792+0000 d20023| 2020-03-05T23:47:06.792+0000 D2 MIGRATION 22039 [MigrationCoordinatorStepupRecovery] "Recovering migration {doc}","attr":{"doc":{"_id":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"},"migrationSessionId":"move_chunk_find_and_modify_with_write_retryability-rs1_move_chunk_find_and_modify_with_write_retryability-rs0_5e618f7a444eb86b3f2befca","lsid":{"id":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"txnNumber":2,"nss":"test.findAndMod-update-postImage","collectionUuid":{"$uuid":"e46a8870-d22b-43ed-b91f-142265d8be10"},"donorShardId":"move_chunk_find_and_modify_with_write_retryability-rs1","recipientShardId":"move_chunk_find_and_modify_with_write_retryability-rs0","range":{"min":{"x":{"$minKey":1}},"max":{"x":{"$maxKey":1}}},"preMigrationChunkVersion":{"0":{"$timestamp":{"t":2,"i":0}},"1":{"$oid":"5e618f62bcc83a106428cd17"}}}} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.809+0000 d20023| | 2020-03-05T23:47:06.809+0000 I SH_REFR 24104 [ConfigServerCatalogCacheLoader-1] "Refresh for collection {nss}{existingRoutingInfo_from_version_existingRoutingInfo_getVersion} to version {routingInfoAfterRefresh_getVersion} took {t_millis} ms","attr":{"nss":"test.findAndMod-update-postImage","existingRoutingInfo_from_version_existingRoutingInfo_getVersion":" from version 2|0||5e618f62bcc83a106428cd17","routingInfoAfterRefresh_getVersion":"3|0||5e618f62bcc83a106428cd17","t_millis":7} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.814+0000 d20023| | 2020-03-05T23:47:06.814+0000 I SH_REFR 24104 [ConfigServerCatalogCacheLoader-1] "Refresh for collection {nss}{existingRoutingInfo_from_version_existingRoutingInfo_getVersion} to version {routingInfoAfterRefresh_getVersion} took {t_millis} ms","attr":{"nss":"test.findAndMod-update-postImage","existingRoutingInfo_from_version_existingRoutingInfo_getVersion":" from version 2|0||5e618f62bcc83a106428cd17","routingInfoAfterRefresh_getVersion":"3|0||5e618f62bcc83a106428cd17","t_millis":5} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.814+0000 d20023| | 2020-03-05T23:47:06.814+0000 I SHARDING 21985 [MigrationCoordinatorStepupRecovery] "Updating metadata for collection {nss_ns} from {activeMetadata_Basic} to {remoteMetadata_Basic} due to version change","attr":{"nss_ns":"test.findAndMod-update-postImage","activeMetadata_Basic":"collection version: 2|0||5e618f62bcc83a106428cd17, shard version: 2|0||5e618f62bcc83a106428cd17","remoteMetadata_Basic":"collection version: 3|0||5e618f62bcc83a106428cd17, shard version: 3|0||5e618f62bcc83a106428cd17"} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.815+0000 d20023| 2020-03-05T23:47:06.814+0000 D2 MIGRATION 23891 [MigrationCoordinatorStepupRecovery] "MigrationCoordinator setting migration decision to {decision_Decision_kCommitted_committed_aborted}","attr":{"decision_Decision_kCommitted_committed_aborted":"aborted","migrationId":{"uuid":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"}}} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.815+0000 d20023| 2020-03-05T23:47:06.815+0000 I MIGRATION 23893 [MigrationCoordinatorStepupRecovery] "MigrationCoordinator delivering decision {decision_Decision_kCommitted_committed_aborted} to self and to recipient","attr":{"decision_Decision_kCommitted_committed_aborted":"aborted","migrationId":{"uuid":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"}}} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.815+0000 d20023| 2020-03-05T23:47:06.815+0000 D2 MIGRATION 23899 [MigrationCoordinatorStepupRecovery] "Making abort decision durable","attr":{"migrationId":{"uuid":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"}}} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.828+0000 d20023| 2020-03-05T23:47:06.828+0000 D2 MIGRATION 23900 [MigrationCoordinatorStepupRecovery] "Bumping transaction for {migrationInfo_getRecipientShardId} lsid: {migrationInfo_getLsid} txn: {TxnNumber}","attr":{"migrationInfo_getRecipientShardId":"move_chunk_find_and_modify_with_write_retryability-rs0","migrationInfo_getLsid":{"id":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"TxnNumber":2,"migrationId":{"uuid":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"}}} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.828+0000 d20023| 2020-03-05T23:47:06.828+0000 D1 MIGRATION 22023 [MigrationCoordinatorStepupRecovery] "Sending request {cmd} to recipient.","attr":{"cmd":{"update":"system.version","bypassDocumentValidation":false,"ordered":true,"updates":[{"q":{"_id":"migrationCoordinatorStats"},"u":{"$inc":{"count":1}},"multi":false,"upsert":true}],"writeConcern":{"w":"majority"},"lsid":{"id":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"txnNumber":3}} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.829+0000 d20020| 2020-03-05T23:47:06.828+0000 D4 TXN 23984 [conn55] "New transaction started with txnNumber: {txnNumber} on session with lsid {sessionId_getId}","attr":{"txnNumber":3,"sessionId_getId":{"uuid":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"}}} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.861+0000 d20020| 2020-03-05T23:47:06.861+0000 I MIGRATION 21998 [migrateThread] "{msg}","attr":{"msg":"migrate failed: TransactionTooOld: Retryable write with txnNumber 2 is prohibited on session bea47927-e355-43ad-8cf1-4f0ac2f720a3 - u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A= because a newer retryable write with txnNumber 3 has already started on this session."} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.862+0000 d20023| 2020-03-05T23:47:06.862+0000 I SHARDING 21993 [MoveChunk] "moveChunk data transfer progress: {res} mem used: {memoryUsed} documents remaining to clone: {cloneLocsRemaining}","attr":{"res":{"waited":true,"active":false,"ns":"test.findAndMod-update-postImage","from":"move_chunk_find_and_modify_with_write_retryability-rs1/ip-10-122-8-3:20022,ip-10-122-8-3:20023","fromShardId":"move_chunk_find_and_modify_with_write_retryability-rs1","min":{"x":{"$minKey":1}},"max":{"x":{"$maxKey":1}},"shardKeyPattern":{"x":1},"state":"fail","errmsg":"migrate failed: TransactionTooOld: Retryable write with txnNumber 2 is prohibited on session bea47927-e355-43ad-8cf1-4f0ac2f720a3 - u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A= because a newer retryable write with txnNumber 3 has already started on this session.","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0},"ok":1,"$gleStats":{"lastOpTime":{"ts":{"$timestamp":{"t":1583452026,"i":18}},"t":1},"electionId":{"$oid":"7fffffff0000000000000001"}},"lastCommittedOpTime":{"$timestamp":{"t":1583452026,"i":15}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1583452026,"i":8}},"t":1}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1583452026,"i":20}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1583452026,"i":20}}},"memoryUsed":0,"cloneLocsRemaining":1} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.873+0000 d20023| 2020-03-05T23:47:06.872+0000 D2 MIGRATION 23900 [MoveChunk] "Bumping transaction for {migrationInfo_getRecipientShardId} lsid: {migrationInfo_getLsid} txn: {TxnNumber}","attr":{"migrationInfo_getRecipientShardId":"move_chunk_find_and_modify_with_write_retryability-rs0","migrationInfo_getLsid":{"id":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"TxnNumber":2,"migrationId":{"uuid":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"}}} [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.925+0000