[SERVER-55357] Repeated retryable write execution on tenant migration recipient Created: 19/Mar/21  Updated: 29/Oct/23  Resolved: 05/Apr/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.9.0-rc1, 5.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Cheahuychou Mao Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: pm-1791_non-cloud-blocking
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-52713 [testing] Add stepdown/kill/terminate... Closed
is depended on by SERVER-55236 Test retrying session migration after... Closed
Related
is related to SERVER-57098 Recipient cannot uassert that we don'... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.9
Sprint: Repl 2021-04-05, Repl 2021-04-19
Participants:

 Description   

Based on the EVG patch builds for SERVER-52713, the recipient can end up processing a retryable write more than once and hit this uassert

buildscripts.resmokelib.errors.ServerFailure: Tenant migration with donor replica set 'rs1' aborted due to an error: {'state': 'aborted', 'abortReason': {'code': 5350902, 'codeName': 'Location5350902', 'errmsg': 'Tenant migration recipient command failed :: caused by :: Tenant oplog application processed same retryable write twice for transaction 0 statement 0 on session 49660e6b-331b-47a2-aa17-efe2c169c8a5 - 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU='}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1616103291, 4), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1616103291, 4)}

or fail this fassert.

[j0:rs0:n1] | 2021-03-19T02:54:54.971+00:00 F  STORAGE  40526   [conn114] "Statement from transaction was committed twice. This indicates possible data corruption or server bug and the process will be terminated","attr":{"stmtId":0,"lsid":{"id":{"$uuid":"27ab03e4-f71f-4846-b6c3-7b2d506cb331"},"uid":{"$binary":{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}},"txnNumber":49,"firstCommitOpTime":{"ts":{"$timestamp":{"t":1616122473,"i":65}},"t":2},"secondCommitOpTime":{"ts":{"$timestamp":{"t":1616122455,"i":11}},"t":2}}
[j0:rs0:n1] | 2021-03-19T02:54:55.232+00:00 I  CONTROL  31380   [conn114] "BACKTRACE","attr":{"bt":{"backtrace":[{"a":"7FA544A6A112","b":"7FA544878000","o":"1F2112","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.630","s+":"212"},{"a":"7FA544A6BB39","b":"7FA544878000","o":"1F3B39","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"7FA544A6707F","b":"7FA544878000","o":"1EF07F","s":"abruptQuit","s+":"6F"},{"a":"7FA542E86D80","b":"7FA542E74000","o":"12D80","s":"funlockfile","s+":"50"},{"a":"7FA54276593F","b":"7FA54272E000","o":"3793F","s":"gsignal","s+":"10F"},{"a":"7FA54274FC95","b":"7FA54272E000","o":"21C95","s":"abort","s+":"127"},{"a":"7FA544996916","b":"7FA544878000","o":"11E916","s":"_ZN5mongo25fassertFailedWithLocationEiPKcj","s+":"12B"},{"a":"7FA5405F5349","b":"7FA5405C3000","o":"32349","s":"_ZN5mongo12_GLOBAL__N_126fassertOnRepeatedExecutionERKNS_16LogicalSessionIdEliRKNS_4repl6OpTimeES7_.cold.2327","s+":"16"},{"a":"7FA5406172E2","b":"7FA5405C3000","o":"542E2","s":"_ZN5mongo22TransactionParticipant11Participant27_refreshFromStorageIfNeededEPNS_16OperationContextEb","s+":"962"},{"a":"7FA5405FE69F","b":"7FA5405C3000","o":"3B69F","s":"_ZN5mongo29MongoDOperationContextSessionC1EPNS_16OperationContextE","s+":"4F"},{"a":"7FA53BD91436","b":"7FA53BD54000","o":"3D436","s":"_ZN5mongo12_GLOBAL__N_131CheckoutSessionAndInvokeCommand3runEv","s+":"136"},{"a":"7FA53BD9267A","b":"7FA53BD54000","o":"3E67A","s":"_ZN5mongo12_GLOBAL__N_114RunCommandImpl11_runCommandEv","s+":"29A"},{"a":"7FA53BD92D83","b":"7FA53BD54000","o":"3ED83","s":"_ZN5mongo12_GLOBAL__N_132RunCommandAndWaitForWriteConcern24_runCommandWithFailPointEv","s+":"83"},{"a":"7FA53BD9326C","b":"7FA53BD54000","o":"3F26C","s":"_ZN5mongo12_GLOBAL__N_132RunCommandAndWaitForWriteConcern8_runImplEv","s+":"3C"},{"a":"7FA53BD860D9","b":"7FA53BD54000","o":"320D9","s":"_ZN5mongo12_GLOBAL__N_114RunCommandImpl3runEv","s+":"139"},{"a":"7FA53BD867AB","b":"7FA53BD54000","o":"327AB","s":"_ZN5mongo12_GLOBAL__N_119ExecCommandDatabase12_commandExecEv","s+":"1CB"},{"a":"7FA53BD8E7DB","b":"7FA53BD54000","o":"3A7DB","s":"_ZZN5mongo12_GLOBAL__N_114executeCommandESt10shared_ptrINS0_13HandleRequest16ExecutionContextEEENUlvE0_clEv","s+":"44B"},{"a":"7FA53BD908A0","b":"7FA53BD54000","o":"3C8A0","s":"_ZZN5mongo15unique_functionIFvPNS_14future_details15SharedStateBaseEEE8makeImplIZNS1_10FutureImplINS1_8FakeVoidEE16makeContinuationIvZZNOS9_4thenIZNS_12_GLOBAL__N_114executeCommandESt10shared_ptrINSC_13HandleRequest16ExecutionContextEEEUlvE0_EEDaOT_ENKUlvE1_clEvEUlPNS1_15SharedStateImplIS8_EESN_E_EENS7_ISI_EEOT0_EUlS3_E_EEDaSJ_EN12SpecificImpl4callEOS3_","s+":"80"},{"a":"7FA544BD7ABF","b":"7FA544BA0000","o":"37ABF","s":"_ZN5mongo14future_details15SharedStateBase20transitionToFinishedEv","s+":"12F"},{"a":"7FA541EF9BA2","b":"7FA541E84000","o":"75BA2","s":"_ZN5mongo14future_details10FutureImplINS0_8FakeVoidEE11generalImplIZNOS3_17propagateResultToEPNS0_15SharedStateImplIS2_EEEUlOS2_E_ZNOS3_17propagateResultToES7_EUlONS_6StatusEE0_ZNOS3_17propagateResultToES7_EUlvE1_EEDaOT_OT0_OT1_","s+":"152"},{"a":"7FA53BD8B25C","b":"7FA53BD54000","o":"3725C","s":"_ZZN5mongo15unique_functionIFvPNS_14future_details15SharedStateBaseEEE8makeImplIZNS1_10FutureImplINS1_8FakeVoidEE16makeContinuationIvZZNOS9_4thenIZNS_12_GLOBAL__N_114executeCommandESt10shared_ptrINSC_13HandleRequest16ExecutionContextEEEUlvE_EEDaOT_ENKUlvE1_clEvEUlPNS1_15SharedStateImplIS8_EESN_E_EENS7_ISI_EEOT0_EUlS3_E_EEDaSJ_EN12SpecificImpl4callEOS3_","s+":"AC"},{"a":"7FA544BD7ABF","b":"7FA544BA0000","o":"37ABF","s":"_ZN5mongo14future_details15SharedStateBase20transitionToFinishedEv","s+":"12F"},{"a":"7FA53BD8F1A1","b":"7FA53BD54000","o":"3B1A1","s":"_ZN5mongo12_GLOBAL__N_114executeCommandESt10shared_ptrINS0_13HandleRequest16ExecutionContextEE","s+":"211"},{"a":"7FA53BD8FB0B","b":"7FA53BD54000","o":"3BB0B","s":"_ZN5mongo12_GLOBAL__N_116receivedCommandsESt10shared_ptrINS0_13HandleRequest16ExecutionContextEE","s+":"1BB"},{"a":"7FA53BD907E6","b":"7FA53BD54000","o":"3C7E6","s":"_ZN5mongo12_GLOBAL__N_115CommandOpRunner3runEv","s+":"36"},{"a":"7FA53BD87888","b":"7FA53BD54000","o":"33888","s":"_ZZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageESt10unique_ptrIKNS0_5HooksESt14default_deleteIS8_EEENKUlvE_clEv","s+":"A8"},{"a":"7FA53BD87DD4","b":"7FA53BD54000","o":"33DD4","s":"_ZZNO5mongo14future_details10FutureImplINS0_8FakeVoidEE4thenIZNS_23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageESt10unique_ptrIKNS5_5HooksESt14default_deleteISD_EEEUlvE_EEDaOT_ENKUlOS2_E_clESK_.isra.2166","s+":"24"},{"a":"7FA53BD89D09","b":"7FA53BD54000","o":"35D09","s":"_ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageESt10unique_ptrIKNS0_5HooksESt14default_deleteIS8_EE","s+":"2D9"},{"a":"7FA541420F83","b":"7FA541417000","o":"9F83","s":"_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE","s+":"53"},{"a":"7FA541400C3D","b":"7FA5413CE000","o":"32C3D","s":"_ZN5mongo9transport19ServiceStateMachine4Impl14processMessageEv","s+":"12D"},{"a":"7FA541401027","b":"7FA5413CE000","o":"33027","s":"_ZZNO5mongo14future_details10FutureImplINS0_8FakeVoidEE4thenIZNS_9transport19ServiceStateMachine4Impl12startNewLoopERKNS_6StatusEEUlvE0_EEDaOT_ENKUlOS2_E_clESE_.isra.724","s+":"27"},{"a":"7FA5414037D0","b":"7FA5413CE000","o":"357D0","s":"_ZN5mongo9transport19ServiceStateMachine4Impl12startNewLoopERKNS_6StatusE","s+":"240"},{"a":"7FA541403CAF","b":"7FA5413CE000","o":"35CAF","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZZNS_9transport19ServiceStateMachine4Impl12startNewLoopERKS1_ENKUlS1_E2_clES1_EUlS1_E0_EEDaOT_EN12SpecificImpl4callEOS1_","s+":"7F"},{"a":"7FA53DEA7B01","b":"7FA53DE75000","o":"32B01","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_9transport26ServiceExecutorSynchronous18runOnDataAvailableERKSt10shared_ptrINS5_7SessionEES3_EUlS1_E_EEDaOT_EN12SpecificImpl4callEOS1_","s+":"41"},{"a":"7FA53DE9E143","b":"7FA53DE75000","o":"29143","s":"_ZZN5mongo15unique_functionIFvvEE8makeImplIZNS_9transport15ServiceExecutor8scheduleENS0_IFvNS_6StatusEEEEEUlvE_EEDaOT_EN12SpecificImpl4callEv","s+":"33"},{"a":"7FA53DEAA7E0","b":"7FA53DE75000","o":"357E0","s":"_ZZN5mongo15unique_functionIFvvEE8makeImplIZNS_9transport26ServiceExecutorSynchronous12scheduleTaskES2_NS4_15ServiceExecutor13ScheduleFlagsEEUlvE0_EEDaOT_EN12SpecificImpl4callEv","s+":"B0"},{"a":"7FA53DEABE3C","b":"7FA53DE75000","o":"36E3C","s":"_ZZN5mongo15unique_functionIFvvEE8makeImplIZNS_25launchServiceWorkerThreadES2_EUlvE2_EEDaOT_EN12SpecificImpl4callEv","s+":"5C"},{"a":"7FA53DEABEAC","b":"7FA53DE75000","o":"36EAC","s":"_ZN5mongo12_GLOBAL__N_17runFuncEPv","s+":"1C"},{"a":"7FA542E7C2DE","b":"7FA542E74000","o":"82DE","s":"start_thread","s+":"FE"},{"a":"7FA54282AA63","b":"7FA54272E000","o":"FCA63","s":"clone","s+":"43"}],"processInfo":{"mongodbVersion":"4.9.0-alpha7-169-g612a372-patch-6054068b5623432db686814d","gitVersion":"612a3725d98381bf9c0777bcd6b2169cae33f4d1","compiledModules":["enterprise"],"uname":{"sysname":"Linux","release":"4.18.0-80.1.2.el8_0.x86_64","version":"#1 SMP Sun Apr 28 09:21:22 UTC 2019","machine":"x86_64"},"somap":[{"b":"7FA544878000","path":"/data/mci/c31bb7a5c5cbba085cb15bea09e68574/src/dist-test/bin/../lib/libbase.so","elfType":3,"buildId":"C4E3038F4C4927C9EFB6C61A7C09F5DCB339A1AE"},{"b":"7FA544BA0000","path":"/data/mci/c31bb7a5c5cbba085cb15bea09e68574/src/dist-test/bin/../lib/libmongod_main.so","elfType":3,"buildId":"2267A5EDC9634E0DE3E9AF323BC4D790F7610D30"},{"b":"7FA542E74000","path":"/lib64/libpthread.so.0","elfType":3,"buildId":"5326B8728FA01B7149DAC943100F1405533E76CE"},{"b":"7FA54272E000","path":"/lib64/libc.so.6","elfType":3,"buildId":"0598B7D6A05E64AE676133CF6331AF5578888AD0"},{"b":"7FA541E84000","path":"/data/mci/c31bb7a5c5cbba085cb15bea09e68574/src/dist-test/bin/../lib/../lib/libtransport_layer.so","elfType":3,"buildId":"3FBB6683B5CB3C7C87B286DE51C739F8622D1ABB"},{"b":"7FA541417000","path":"/data/mci/c31bb7a5c5cbba085cb15bea09e68574/src/dist-test/bin/../lib/../lib/libservice_context_d.so","elfType":3,"buildId":"3C84D0105E8DB818D385378355129606BAAB347A"},{"b":"7FA5413CE000","path":"/data/mci/c31bb7a5c5cbba085cb15bea09e68574/src/dist-test/bin/../lib/../lib/libservice_entry_point.so","elfType":3,"buildId":"BA30636559FF9AC794DA6DE2EA19C820AF53A103"},{"b":"7FA5405C3000","path":"/data/mci/c31bb7a5c5cbba085cb15bea09e68574/src/dist-test/bin/../lib/../lib/libtransaction.so","elfType":3,"buildId":"510EBD03635CF6D728A712365D605AB024CCB98C"},{"b":"7FA53DE75000","path":"/data/mci/c31bb7a5c5cbba085cb15bea09e68574/src/dist-test/bin/../lib/../lib/libservice_executor.so","elfType":3,"buildId":"D3C4A9977BA84311C3E699E518A82EAE33D6A3DC"},{"b":"7FA53BD54000","path":"/data/mci/c31bb7a5c5cbba085cb15bea09e68574/src/dist-test/bin/../lib/../lib/../lib/libservice_entry_point_common.so","elfType":3,"buildId":"7137FAAB4DA4B3D1E2FD98963EC4BE27A703525F"}]}}}

 



 Comments   
Comment by Githook User [ 05/Apr/21 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-55357: Start a new retryable write chain on each migration

(cherry picked from commit 03fbe539140630c6da4a29dfee160649d6453457)
Branch: v4.9
https://github.com/mongodb/mongo/commit/a5e7e1bf475025df6faf17709ba9e34b83c26e69

Comment by Githook User [ 05/Apr/21 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-55357: Start a new retryable write chain on each migration
Branch: master
https://github.com/mongodb/mongo/commit/03fbe539140630c6da4a29dfee160649d6453457

Comment by Lingzhi Deng [ 31/Mar/21 ]

I think the problem is that we always use the lastWriteOpTime from the TransactionParticipant as the prevWriteOpTime for a retryable write noop entry. But if the recipient has migrated the same retryable write before (in a previously aborted migration), the recipient could end up linking the new retryable write noops to the existing retryable write chain for the same session.
For example, for retryable write with session 1, txn 1:
1. After a failed migration, recipient could have lastWriteOpTime 10 for session 1, txn 1 in config.transactions.
2. Retrying the same migration, recipient would migrate the same retryable write and update lastWriteOpTime 20 for session 1, txn 1 in config.transactions. And the recipient would also link the newly generated retryable noop at ts20 to the retryable noop at ts10 from the last migration.
3. The recipient's retryable chain ends up having two entries for the same stmtId.

I was able to reproduce the same issue in SERVER-55236.

Generated at Thu Feb 08 05:36:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.