Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-46756

Migration recovery on stepup can attempt to "recover" a move chunk already running and cause it to abort

    • Fully Compatible
    • v4.4
    • Sharding 2020-03-23
    • 39

      On step up, a node runs resumeMigrationCoordinationsOnStepUp which reads docs in the migrationCoordinators collection in order to recover any migrations. The following race can happen if there is a moveChunk where the source is the node stepping up run quickly after the stepdown/stepup command:

      1. node 1 in shard A is stepping up
      2. move chunk from shard A to shard B starts
      3. node 1 shard A writes the migration coordinator doc from the move chunk in step 2
      4. node 1 shard A is running resumeMigrationCoordinationsOnStepUp and finds the doc written in #3 and starts that moveChunk again
      5. node 1 aborts the migration

      This isn't a correctness bug, but causes migrations to abort unnecessarily. Below are log lines showing this scenario:

      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.402+0000 d20023| 2020-03-05T23:47:06.402+0000 D2 MIGRATION 23889 [MoveChunk] "Starting chunk migration {args} with expected collection version epoch {args_getVersionEpoch}","attr":{"args":"ns: test.findAndMod-update-postImage, [{ x: MinKey }, { x: MaxKey }), fromShard: move_chunk_find_and_modify_with_write_retryability-rs1, toShard: move_chunk_find_and_modify_with_write_retryability-rs0","args_getVersionEpoch":{"$oid":"5e618f62bcc83a106428cd17"}}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.402+0000 d20023| 2020-03-05T23:47:06.402+0000 D2 MIGRATION 23889   [MoveChunk] "Persisting migration coordinator doc","attr":{"migrationDoc":{"_id":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"},"migrationSessionId":"move_chunk_find_and_modify_with_write_retryability-rs1_move_chunk_find_and_modify_with_write_retryability-rs0_5e618f7a444eb86b3f2befca","lsid":{"id":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"txnNumber":2,"nss":"test.findAndMod-update-postImage","collectionUuid":{"$uuid":"e46a8870-d22b-43ed-b91f-142265d8be10"},"donorShardId":"move_chunk_find_and_modify_with_write_retryability-rs1","recipientShardId":"move_chunk_find_and_modify_with_write_retryability-rs0","range":{"min":{"x":{"$minKey":1}},"max":{"x":{"$maxKey":1}}},"preMigrationChunkVersion":{"0":{"$timestamp":{"t":2,"i":0}},"1":{"$oid":"5e618f62bcc83a106428cd17"}}}}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.792+0000 d20023| 2020-03-05T23:47:06.792+0000 I  TXN      22452   [TransactionCoordinator] "Need to resume coordinating commit for {coordinatorDocs_size} transactions","attr":{"coordinatorDocs_size":0}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.792+0000 d20023| 2020-03-05T23:47:06.792+0000 D2 MIGRATION 22039   [MigrationCoordinatorStepupRecovery] "Recovering migration {doc}","attr":{"doc":{"_id":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"},"migrationSessionId":"move_chunk_find_and_modify_with_write_retryability-rs1_move_chunk_find_and_modify_with_write_retryability-rs0_5e618f7a444eb86b3f2befca","lsid":{"id":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"txnNumber":2,"nss":"test.findAndMod-update-postImage","collectionUuid":{"$uuid":"e46a8870-d22b-43ed-b91f-142265d8be10"},"donorShardId":"move_chunk_find_and_modify_with_write_retryability-rs1","recipientShardId":"move_chunk_find_and_modify_with_write_retryability-rs0","range":{"min":{"x":{"$minKey":1}},"max":{"x":{"$maxKey":1}}},"preMigrationChunkVersion":{"0":{"$timestamp":{"t":2,"i":0}},"1":{"$oid":"5e618f62bcc83a106428cd17"}}}}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.809+0000 d20023| | 2020-03-05T23:47:06.809+0000 I  SH_REFR  24104   [ConfigServerCatalogCacheLoader-1] "Refresh for collection {nss}{existingRoutingInfo_from_version_existingRoutingInfo_getVersion} to version {routingInfoAfterRefresh_getVersion} took {t_millis} ms","attr":{"nss":"test.findAndMod-update-postImage","existingRoutingInfo_from_version_existingRoutingInfo_getVersion":" from version 2|0||5e618f62bcc83a106428cd17","routingInfoAfterRefresh_getVersion":"3|0||5e618f62bcc83a106428cd17","t_millis":7}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.814+0000 d20023| | 2020-03-05T23:47:06.814+0000 I  SH_REFR  24104   [ConfigServerCatalogCacheLoader-1] "Refresh for collection {nss}{existingRoutingInfo_from_version_existingRoutingInfo_getVersion} to version {routingInfoAfterRefresh_getVersion} took {t_millis} ms","attr":{"nss":"test.findAndMod-update-postImage","existingRoutingInfo_from_version_existingRoutingInfo_getVersion":" from version 2|0||5e618f62bcc83a106428cd17","routingInfoAfterRefresh_getVersion":"3|0||5e618f62bcc83a106428cd17","t_millis":5}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.814+0000 d20023| | 2020-03-05T23:47:06.814+0000 I  SHARDING 21985   [MigrationCoordinatorStepupRecovery] "Updating metadata for collection {nss_ns} from {activeMetadata_Basic} to {remoteMetadata_Basic} due to version change","attr":{"nss_ns":"test.findAndMod-update-postImage","activeMetadata_Basic":"collection version: 2|0||5e618f62bcc83a106428cd17, shard version: 2|0||5e618f62bcc83a106428cd17","remoteMetadata_Basic":"collection version: 3|0||5e618f62bcc83a106428cd17, shard version: 3|0||5e618f62bcc83a106428cd17"}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.815+0000 d20023| 2020-03-05T23:47:06.814+0000 D2 MIGRATION 23891   [MigrationCoordinatorStepupRecovery] "MigrationCoordinator setting migration decision to {decision_Decision_kCommitted_committed_aborted}","attr":{"decision_Decision_kCommitted_committed_aborted":"aborted","migrationId":{"uuid":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"}}}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.815+0000 d20023| 2020-03-05T23:47:06.815+0000 I  MIGRATION 23893   [MigrationCoordinatorStepupRecovery] "MigrationCoordinator delivering decision {decision_Decision_kCommitted_committed_aborted} to self and to recipient","attr":{"decision_Decision_kCommitted_committed_aborted":"aborted","migrationId":{"uuid":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"}}}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.815+0000 d20023| 2020-03-05T23:47:06.815+0000 D2 MIGRATION 23899   [MigrationCoordinatorStepupRecovery] "Making abort decision durable","attr":{"migrationId":{"uuid":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"}}}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.828+0000 d20023| 2020-03-05T23:47:06.828+0000 D2 MIGRATION 23900   [MigrationCoordinatorStepupRecovery] "Bumping transaction for {migrationInfo_getRecipientShardId} lsid: {migrationInfo_getLsid} txn: {TxnNumber}","attr":{"migrationInfo_getRecipientShardId":"move_chunk_find_and_modify_with_write_retryability-rs0","migrationInfo_getLsid":{"id":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"TxnNumber":2,"migrationId":{"uuid":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"}}}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.828+0000 d20023| 2020-03-05T23:47:06.828+0000 D1 MIGRATION 22023   [MigrationCoordinatorStepupRecovery] "Sending request {cmd} to recipient.","attr":{"cmd":{"update":"system.version","bypassDocumentValidation":false,"ordered":true,"updates":[{"q":{"_id":"migrationCoordinatorStats"},"u":{"$inc":{"count":1}},"multi":false,"upsert":true}],"writeConcern":{"w":"majority"},"lsid":{"id":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"txnNumber":3}}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.829+0000 d20020| 2020-03-05T23:47:06.828+0000 D4 TXN      23984   [conn55] "New transaction started with txnNumber: {txnNumber} on session with lsid {sessionId_getId}","attr":{"txnNumber":3,"sessionId_getId":{"uuid":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"}}}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.861+0000 d20020| 2020-03-05T23:47:06.861+0000 I  MIGRATION 21998   [migrateThread] "{msg}","attr":{"msg":"migrate failed: TransactionTooOld: Retryable write with txnNumber 2 is prohibited on session bea47927-e355-43ad-8cf1-4f0ac2f720a3 - u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A= because a newer retryable write with txnNumber 3 has already started on this session."}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.862+0000 d20023| 2020-03-05T23:47:06.862+0000 I  SHARDING 21993   [MoveChunk] "moveChunk data transfer progress: {res} mem used: {memoryUsed} documents remaining to clone: {cloneLocsRemaining}","attr":{"res":{"waited":true,"active":false,"ns":"test.findAndMod-update-postImage","from":"move_chunk_find_and_modify_with_write_retryability-rs1/ip-10-122-8-3:20022,ip-10-122-8-3:20023","fromShardId":"move_chunk_find_and_modify_with_write_retryability-rs1","min":{"x":{"$minKey":1}},"max":{"x":{"$maxKey":1}},"shardKeyPattern":{"x":1},"state":"fail","errmsg":"migrate failed: TransactionTooOld: Retryable write with txnNumber 2 is prohibited on session bea47927-e355-43ad-8cf1-4f0ac2f720a3 - u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A= because a newer retryable write with txnNumber 3 has already started on this session.","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0},"ok":1,"$gleStats":{"lastOpTime":{"ts":{"$timestamp":{"t":1583452026,"i":18}},"t":1},"electionId":{"$oid":"7fffffff0000000000000001"}},"lastCommittedOpTime":{"$timestamp":{"t":1583452026,"i":15}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1583452026,"i":8}},"t":1}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1583452026,"i":20}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1583452026,"i":20}}},"memoryUsed":0,"cloneLocsRemaining":1}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.873+0000 d20023| 2020-03-05T23:47:06.872+0000 D2 MIGRATION 23900   [MoveChunk] "Bumping transaction for {migrationInfo_getRecipientShardId} lsid: {migrationInfo_getLsid} txn: {TxnNumber}","attr":{"migrationInfo_getRecipientShardId":"move_chunk_find_and_modify_with_write_retryability-rs0","migrationInfo_getLsid":{"id":{"$uuid":"bea47927-e355-43ad-8cf1-4f0ac2f720a3"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"TxnNumber":2,"migrationId":{"uuid":{"$uuid":"d5de5090-339c-4ce9-b418-3a50afba6706"}}}
      [js_test:move_chunk_find_and_modify_with_write_retryability] 2020-03-05T23:47:06.925+0000 
      

            Assignee:
            esha.maharishi@mongodb.com Esha Maharishi (Inactive)
            Reporter:
            janna.golden@mongodb.com Janna Golden
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: