[SERVER-38190] killOp while committing a prepared transaction can trigger std::terminate Created: 16/Nov/18  Updated: 29/Oct/23  Resolved: 28/Jan/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.8

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Randolph Tan
Resolution: Fixed Votes: 0
Labels: prepare_errors, todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-37991 Release ReplicationStateTransitionLoc... Closed
Related
related to SERVER-38299 killOp while preparing a transaction ... Closed
related to SERVER-42987 If an abortTransaction command gets i... Closed
related to SERVER-39949 Replace uses of UninterruptibleLockGu... Closed
is related to SERVER-36485 ‘killSessions’ (for one session) and ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2018-12-03, Sharding 2018-12-17, Sharding 2018-12-31, Sharding 2019-01-14, Sharding 2019-01-28, Sharding 2019-02-11
Participants:

 Description   

If a killOp (direct killOp or triggered by killing a session) kills a thread committing a prepared transaction, it can throw an interrupted exception in this try block in TransactionParticipant::commitPreparedTransaction, which is caught and triggers a std::terminate.

Example failure and backtrace (evergreen logs w/ bookmarks):

[ShardedClusterFixture:job4:shard0:primary] 2018-11-16T21:36:58.109+0000 F -        [conn79] DBException::toString(): Interrupted: operation was interrupted
[ShardedClusterFixture:job4:shard0:primary] Actual exception type: mongo::error_details::ExceptionForImpl<(mongo::ErrorCodes::Error)11601, mongo::ExceptionForCat<(mongo::ErrorCategory)1> >
[ShardedClusterFixture:job4:shard0:primary]  0x7fdf369e44d1 0x7fdf369e3eb5 0x7fdf36ad96d6 0x7fdf36ad9721 0x7fdf35f01d74 0x7fdf354ee324 0x7fdf3633c7a4 0x7fdf35079d20 0x7fdf3507a3a6 0x7fdf3507c8f1 0x7fdf3507e08c 0x7fdf3507f076 0x7fdf3506a07a 0x7fdf35077763 0x7fdf35072007 0x7fdf350759f1 0x7fdf3618d492 0x7fdf350700e0 0x7fdf35073425 0x7fdf3507163f 0x7fdf3507208d 0x7fdf350759f1 0x7fdf3618da05 0x7fdf36939c94 0x7fdf3185aaa1 0x7fdf315a7bdd
[ShardedClusterFixture:job4:shard0:primary] ----- BEGIN BACKTRACE -----
[ShardedClusterFixture:job4:shard0:primary] {"backtrace":[{"b":"7FDF344CC000","o":"25184D1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7FDF344CC000","o":"2517EB5"},{"b":"7FDF344CC000","o":"260D6D6","s":"_ZN10__cxxabiv111__terminateEPFvvE"},{"b":"7FDF344CC000","o":"260D721"},{"b":"7FDF344CC000","o":"1A35D74","s":"_ZN5mongo22TransactionParticipant25commitPreparedTransactionEPNS_16OperationContextENS_9TimestampE"},{"b":"7FDF344CC000","o":"1022324"},{"b":"7FDF344CC000","o":"1E707A4","s":"_ZN5mongo12BasicCommand10Invocation3runEPNS_16OperationContextEPNS_3rpc21ReplyBuilderInterfaceE"},{"b":"7FDF344CC000","o":"BADD20"},{"b":"7FDF344CC000","o":"BAE3A6"},{"b":"7FDF344CC000","o":"BB08F1"},{"b":"7FDF344CC000","o":"BB208C"},{"b":"7FDF344CC000","o":"BB3076","s":"_ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageERKNS0_5HooksE"},{"b":"7FDF344CC000","o":"B9E07A","s":"_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE"},{"b":"7FDF344CC000","o":"BAB763","s":"_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE"},{"b":"7FDF344CC000","o":"BA6007","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"7FDF344CC000","o":"BA99F1"},{"b":"7FDF344CC000","o":"1CC1492","s":"_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE"},{"b":"7FDF344CC000","o":"BA40E0","s":"_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE"},{"b":"7FDF344CC000","o":"BA7425","s":"_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE"},{"b":"7FDF344CC000","o":"BA563F","s":"_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE"},{"b":"7FDF344CC000","o":"BA608D","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"7FDF344CC000","o":"BA99F1"},{"b":"7FDF344CC000","o":"1CC1A05"},{"b":"7FDF344CC000","o":"246DC94"},{"b":"7FDF31853000","o":"7AA1"},{"b":"7FDF314BF000","o":"E8BDD","s":"clone"}],"processInfo":{ "mongodbVersion" : "4.1.5-137-g910a0bda4c-patch-5bef34452fbabe6f7cedf3c0", "gitVersion" : "910a0bda4c434571600e67d78afb2424dc94eaaa", "compiledModules" : [ "enterprise" ], "uname" : { "sysname" : "Linux", "release" : "2.6.32-220.el6.x86_64", "version" : "#1 SMP Wed Nov 9 08:03:13 EST 2011", "machine" : "x86_64" }, "somap" : [ { "b" : "7FDF344CC000", "elfType" : 3, "buildId" : "9DF7BE2657722D848ABF7D6F15D48FA26E0B9136" }, { "b" : "7FFF0D0FF000", "elfType" : 3, "buildId" : "08F634A1D22DEFF00461D50A7699DACDC97657BF" }, { "b" : "7FDF3405D000", "path" : "/usr/lib64/libnetsnmpagent.so.20", "elfType" : 3, "buildId" : "1270BB069D761BD79C79F8986BB3ED5DCAA7D06D" }, { "b" : "7FDF33E37000", "path" : "/usr/lib64/libnetsnmphelpers.so.20", "elfType" : 3, "buildId" : "3FA4F246A7DF00EC1355C5226C9308DC7B4AB5CD" }, { "b" : "7FDF3396F000", "path" : "/usr/lib64/libnetsnmpmibs.so.20", "elfType" : 3, "buildId" : "AE65092368DDB948A32B62D613DD8FFE210EBEB9" }, { "b" : "7FDF33694000", "path" : "/usr/lib64/libnetsnmp.so.20", "elfType" : 3, "buildId" : "52E4D411A95E6C7FCCE0E1942B525AC8FBBDF4A8" }, { "b" : "7FA24C643000", "path" : "/lib64/libldap-2.4.so.2", "elfType" : 3, "buildId" : "DDBAC283102A61D6A63B3F3952A1C06657FF3AE8" }, { "b" : "7FA24C034000", "path" : "/lib64/liblber-2.4.so.2", "elfType" : 3, "buildId" : "244D2593BDE4FE657BC88572DB5DA88FA274B7F3" }, { "b" : "7FA24D61A000", "path" : "/usr/lib64/libsasl2.so.2", "elfType" : 3, "buildId" : "E0AEE889D5BF1373F2F9EE0D448DBF3F5B5113F0" }, { "b" : "7FA24DBD6000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "0C249DF4D77989253CCD859956BF50749308A16A" }, { "b" : "7FDF32B81000", "path" : "/usr/lib64/libcurl.so.4", "elfType" : 3, "buildId" : "A38B9CE8AEAF277CBD8BC1298B1731E2C9A66192" }, { "b" : "7FA251367000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "F0BE1166EDCFFB2422B940D601A1BBD89352D80F" }, { "b" : "7FDF32582000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "1EDB45C205A844A75EBBB4F0075E705803FFB85B" }, { "b" : "7FDF32316000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "D256E285C5E11D9A99EB04CA7651003A8F67B64E" }, { "b" : "7FA252F12000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "1F7E85410384392BC51FA7324961719A10125F31" }, { "b" : "7FA25210A000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "FDF3A36FFFE08375456D59DA959EAB2FC30B6186" }, { "b" : "7FA251A86000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "8A852AC42F0B64F0F30C760EBBCFA3FE4A228F12" }, { "b" : "7FA250870000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "EDC925E58FE28DCA536993EB13179C739F1E6566" }, { "b" : "7FA251E53000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "85104ECFE42C606B31C2D0D0D2E5DACD3286A341" }, { "b" : "7FA251EBF000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "8E3AACE76351B6A83390CA065E904EB82FBD1EC7" }, { "b" : "7FA2554A9000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "1CC2165E019D43F71FDE0A47AF9F4C8EB5E51963" }, { "b" : "7FDF312B4000", "path" : "/lib64/libwrap.so.0", "elfType" : 3, "buildId" : "083332F88CF3C61AB0184D8F397FC8BFF4548D8E" }, { "b" : "7FA250949000", "path" : "/usr/lib64/perl5/CORE/libperl.so", "elfType" : 3, "buildId" : "53842C2896DED0063E1BE5C650CE97C67AE97973" }, { "b" : "7FA250F30000", "path" : "/lib64/libnsl.so.1", "elfType" : 3, "buildId" : "D233CCCC987214EE5DACCF88949E31469228F6FF" }, { "b" : "7FA24ECF9000", "path" : "/lib64/libcrypt.so.1", "elfType" : 3, "buildId" : "F542C8ACD4AD1F2C6A551043BDFBAB051905DA1C" }, { "b" : "7FA24DEF6000", "path" : "/lib64/libutil.so.1", "elfType" : 3, "buildId" : "2963FF1BBF4BF9131097982EB8BE5C905A342CBD" }, { "b" : "7FA24B48A000", "path" : "/usr/lib64/librpm.so.1", "elfType" : 3, "buildId" : "EF9F29119A5A42A613A040DE49BE7B1B46256B21" }, { "b" : "7FA24BA5B000", "path" : "/usr/lib64/librpmio.so.1", "elfType" : 3, "buildId" : "DA04B3F461614CD7524D92152540F6B8303C9F45" }, { "b" : "7FA24EC52000", "path" : "/lib64/libpopt.so.0", "elfType" : 3, "buildId" : "E7B49911F1136073DD7DC58E8118CD9A4FBE2A19" }, { "b" : "7FA24FA3C000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "D053BB4FF0C2FC983842F81598813B9B931AD0D1" }, { "b" : "7FDF2FE2C000", "path" : "/usr/lib64/libsensors.so.4", "elfType" : 3, "buildId" : "6855E5BF5B3634C15F01B1043BD892D727EE3C08" }, { "b" : "7FA24C1D9000", "path" : "/usr/lib64/libssl3.so", "elfType" : 3, "buildId" : "BEF0080919EBF8ADB8C668B38BE1B19FD1F584BC" }, { "b" : "7FA24BBAC000", "path" : "/usr/lib64/libsmime3.so", "elfType" : 3, "buildId" : "2E107F018AF1E0916BE98EB24BFFE1C28CACB7F7" }, { "b" : "7FA24C064000", "path" : "/usr/lib64/libnss3.so", "elfType" : 3, "buildId" : "788E6C43CEF0FE8A2EFEF8FFD7B9B90B28EF44ED" }, { "b" : "7FA24C635000", "path" : "/usr/lib64/libnssutil3.so", "elfType" : 3, "buildId" : "3F385A7A46BD81EF0ED4CBCBEADC4D33AE6247E4" }, { "b" : "7FA24C031000", "path" : "/lib64/libplds4.so", "elfType" : 3, "buildId" : "04D1AC5F1C6C1B1AD5962DCA1225236B0B4953CE" }, { "b" : "7FA24C62C000", "path" : "/lib64/libplc4.so", "elfType" : 3, "buildId" : "C0559AF61C8808D4FC8B97D07EFFA459BFD93003" }, { "b" : "7FA24C7ED000", "path" : "/lib64/libnspr4.so", "elfType" : 3, "buildId" : "828BFCA03E208DFB48C4B874D81140EFC51D33C2" }, { "b" : "7FA24A106000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "624C7056B8BBE6BA758DEF557F516FBDBD01E1FD" }, { "b" : "7FA24A2DA000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "C81673692EEF670BC951EE726490F5D1CAB822F4" }, { "b" : "7FA24D4D6000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "57F77704A7F1F4E3689D028D3F9ADD4E77486EC9" }, { "b" : "7FA2496CB000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "03B69EEB8998AC9CA7519A27571BAD976BA4C56D" }, { "b" : "7FA24A0C8000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "3BCCABE75DC61BBA81AAE45D164E26EF4F9F55DB" }, { "b" : "7FA247E96000", "path" : "/lib64/libidn.so.11", "elfType" : 3, "buildId" : "5659EB985475B586E3BBCB95BA21F4A30BE5EBF4" }, { "b" : "7FDF2DE6E000", "path" : "/usr/lib64/libssh2.so.1", "elfType" : 3, "buildId" : "8727EC925D6D91DAC74A99BDE8B3C6EE96AF13EA" }, { "b" : "7FA24BA6B000", "path" : "/lib64/libfreebl3.so", "elfType" : 3, "buildId" : "AFF1C795A3CF422C9F8AC32C7522F6376B1EA087" }, { "b" : "7FA24A05A000", "path" : "/lib64/libbz2.so.1", "elfType" : 3, "buildId" : "1250B1D041DD7552F0C870BB188DC3A34DF2651D" }, { "b" : "7FA24BA44000", "path" : "/usr/lib64/libelf.so.1", "elfType" : 3, "buildId" : "50517407A07B8D6C9A55A392E99246B52E8BFEEA" }, { "b" : "7FA24B423000", "path" : "/usr/lib64/liblzma.so.0", "elfType" : 3, "buildId" : "6FF9BAEEEE9DDEEF2DFA5CBD36147A75891C0AD4" }, { "b" : "7FA248DF6000", "path" : "/usr/lib64/liblua-5.1.so", "elfType" : 3, "buildId" : "6BDB4E1990D6EBA12A5C8D39A7650DB8798BF568" }, { "b" : "7FA24C7D7000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "B4576BE308DDCF7BC31F7304E4734C3D846D0236" }, { "b" : "7FA2475D3000", "path" : "/lib64/libcap.so.2", "elfType" : 3, "buildId" : "A436538388F1F25113FDA834CA2EED524EFA17D6" }, { "b" : "7FA248BCB000", "path" : "/lib64/libacl.so.1", "elfType" : 3, "buildId" : "26CC708AC7C0FC1797A2340C024F0ADD0CE054D8" }, { "b" : "7FA248C56000", "path" : "/lib64/libdb-4.7.so", "elfType" : 3, "buildId" : "D91C702275E2039E98E39925B02FF5C53A6C3312" }, { "b" : "7FA24AE51000", "path" : "/lib64/libattr.so.1", "elfType" : 3, "buildId" : "8EF0683858704EF173AB11B1E27076F37F82B7B6" } ] }}
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7fdf369e44d1]
[ShardedClusterFixture:job4:shard0:primary]  mongod(+0x2517EB5) [0x7fdf369e3eb5]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN10__cxxabiv111__terminateEPFvvE+0x6) [0x7fdf36ad96d6]
[ShardedClusterFixture:job4:shard0:primary]  mongod(+0x260D721) [0x7fdf36ad9721]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo22TransactionParticipant25commitPreparedTransactionEPNS_16OperationContextENS_9TimestampE+0x7D4) [0x7fdf35f01d74]
[ShardedClusterFixture:job4:shard0:primary]  mongod(+0x1022324) [0x7fdf354ee324]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo12BasicCommand10Invocation3runEPNS_16OperationContextEPNS_3rpc21ReplyBuilderInterfaceE+0x74) [0x7fdf3633c7a4]
[ShardedClusterFixture:job4:shard0:primary]  mongod(+0xBADD20) [0x7fdf35079d20]
[ShardedClusterFixture:job4:shard0:primary]  mongod(+0xBAE3A6) [0x7fdf3507a3a6]
[ShardedClusterFixture:job4:shard0:primary]  mongod(+0xBB08F1) [0x7fdf3507c8f1]
[ShardedClusterFixture:job4:shard0:primary]  mongod(+0xBB208C) [0x7fdf3507e08c]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageERKNS0_5HooksE+0x476) [0x7fdf3507f076]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE+0x3A) [0x7fdf3506a07a]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE+0xD3) [0x7fdf35077763]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0x97) [0x7fdf35072007]
[ShardedClusterFixture:job4:shard0:primary]  mongod(+0xBA99F1) [0x7fdf350759f1]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE+0x1A2) [0x7fdf3618d492]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE+0x150) [0x7fdf350700e0]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE+0xB55) [0x7fdf35073425]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE+0x35F) [0x7fdf3507163f]
[ShardedClusterFixture:job4:shard0:primary]  mongod(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0x11D) [0x7fdf3507208d]
[ShardedClusterFixture:job4:shard0:primary]  mongod(+0xBA99F1) [0x7fdf350759f1]
[ShardedClusterFixture:job4:shard0:primary]  mongod(+0x1CC1A05) [0x7fdf3618da05]
[ShardedClusterFixture:job4:shard0:primary]  mongod(+0x246DC94) [0x7fdf36939c94]
[ShardedClusterFixture:job4:shard0:primary]  libpthread.so.0(+0x7AA1) [0x7fdf3185aaa1]
[ShardedClusterFixture:job4:shard0:primary]  libc.so.6(clone+0x6D) [0x7fdf315a7bdd]
[ShardedClusterFixture:job4:shard0:primary] -----  END BACKTRACE  -----



 Comments   
Comment by Githook User [ 29/Jan/19 ]

Author:

{'email': 'randolph@10gen.com', 'name': 'Randolph Tan'}

Message: SERVER-38190 killOp while committing a prepared transaction can trigger std::terminate
Branch: master
https://github.com/mongodb/mongo/commit/e990d25622d96897d78e72b362db61f2a4f9d99c

Comment by Judah Schvimer [ 17/Jan/19 ]

Assigning to renctan, since it's related to SERVER-38134.

Comment by Judah Schvimer [ 19/Nov/18 ]

This will be fixed by surrounding commitPreparedTransaction with an UninterruptibleLockGuard and holding the RSTL for the entirety of commitPreparedTransaction to prevent further state transitions. We will keep the terminate behavior to make sure that once we start to commit a prepared transaction we are guaranteed to succeed.

Comment by Judah Schvimer [ 19/Nov/18 ]

I guess committing a prepared transaction could be interrupted when acquiring a lock to commit the transaction or when writing the oplog entry. We should make some of this section uninterruptible and add guards to transition the participant back into prepare if it won't complete commit.

Comment by Kaloian Manassiev [ 19/Nov/18 ]

By "this" I presume you mean "thread committing a prepared transaction getting killed", right?

This will still be possible, because the state of the session can change concurrently with the check in the kill code. Furthermore, operation contexts should be able to be interrupted for various other reasons.

Because of this, ideally the commit of prepared transactions code should be made resilient to interruptions or at the very least you should make it uninterruptible so it doesn't crash the server.

Comment by Judah Schvimer [ 19/Nov/18 ]

kaloian.manassiev, should this be possible with your refactor?

Generated at Thu Feb 08 04:48:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.