[SERVER-36841] TTL thread can (very rarely) race with shutdown and crash the server Created: 24/Aug/18  Updated: 29/Oct/23  Resolved: 28/Apr/20

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: James Heppenstall
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-16031 Race condition at shutdown for KVEngines Closed
is related to SERVER-47715 Remove mongos dependency on rw_concer... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2020-04-20, Execution Team 2020-05-04
Participants:
Linked BF Score: 45

 Description   

The TTL thread has an OperationContext, but takes no locks. Therefore, it can race with shutdown of the storage engine, which normally waits for operations to finish by acquiring the global X lock. Then, TTL thread's OperationContext's RecoveryUnit destructor can accesses the storage engine and crash like so:

[ShardedClusterFixture:job9] Successfully stopped mongos.
[ShardedClusterFixture:job9] Stopping shard...
[ShardedClusterFixture:job9:shard0] Stopping mongod on port 22251 with pid 22574...
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.549+0000 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.549+0000 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.549+0000 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-22251.sock
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.550+0000 I ASIO     [signalProcessingThread] Dropping all pooled connections to localhost:22250 due to ShutdownInProgress: Shuting down the connection pool
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.553+0000 W SHARDING [signalProcessingThread] error encountered while cleaning up distributed ping entry for rhel71-ppc-3.pic.build.10gen.cc:22251:1531671881:178497171568223908 :: caused by :: ShutdownInProgress: Shutdown in progress
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.556+0000 W SHARDING [signalProcessingThread] cant reload ShardRegistry  :: caused by :: CallbackCanceled: Callback canceled
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.561+0000 I FTDC     [signalProcessingThread] Shutting down full-time diagnostic data capture
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.564+0000 I STORAGE  [signalProcessingThread] WiredTigerKVEngine shutting down
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.565+0000 E INDEX    [TTLMonitor] Error processing ttl index: { v: 2, key: { lastUse: 1 }, name: "lsidTTLIndex", ns: "config.system.sessions", expireAfterSeconds: 1800 } -- InterruptedAtShutdown: interrupted at shutdown
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.565+0000 I STORAGE  [TTLMonitor] WiredTigerSizeStorer magic wrong: 11111
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.565+0000 F -        [TTLMonitor] Invariant failure _magic == MAGIC src/mongo/db/storage/wiredtiger/wiredtiger_size_storer.cpp 88
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.565+0000 F -        [TTLMonitor]
[ShardedClusterFixture:job9:shard0] 
[ShardedClusterFixture:job9:shard0] ***aborting after invariant() failure
[ShardedClusterFixture:job9:shard0] 
[ShardedClusterFixture:job9:shard0] 
[ShardedClusterFixture:job9:shard0] 2018-07-15T16:26:22.579+0000 F -        [TTLMonitor] Got signal: 6 (Aborted).
[ShardedClusterFixture:job9:shard0]  0x48d68e70 0x48d67c94 0x48d682f4 0x3fff85c60478 0x3fff84cc072c 0x46e0acf4 0x471de0d4 0x471df648 0x471b334c 0x471baf80 0x471dbf10 0x471d88b8 0x471d8938 0x48c6f430 0x473a369c 0x473a3c78 0x48c646fc 0x48eb6c24 0x3fff84e68944 0x3fff84d97640
[ShardedClusterFixture:job9:shard0] ----- BEGIN BACKTRACE -----
[ShardedClusterFixture:job9:shard0] {"backtrace":[{"b":"46400000","o":"2968E70","s":"_ZN5mongo15printStackTraceERSo"},{"b":"46400000","o":"2967C94"},{"b":"46400000","o":"29682F4"},{"b":"3FFF85C60000","o":"478","s":"__kernel_sigtramp_rt64"},{"b":"3FFF84C80000","o":"4072C","s":"abort"},{"b":"46400000","o":"A0ACF4","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"46400000","o":"DDE0D4","s":"_ZNK5mongo20WiredTigerSizeStorer11_checkMagicEv"},{"b":"46400000","o":"DDF648","s":"_ZN5mongo20WiredTigerSizeStorer9syncCacheEb"},{"b":"46400000","o":"DB334C","s":"_ZNK5mongo18WiredTigerKVEngine12syncSizeInfoEb"},{"b":"46400000","o":"DBAF80","s":"_ZNK5mongo18WiredTigerKVEngine15haveDropsQueuedEv"},{"b":"46400000","o":"DDBF10","s":"_ZN5mongo22WiredTigerSessionCache14releaseSessionEPNS_17WiredTigerSessionE"},{"b":"46400000","o":"DD88B8","s":"_ZN5mongo22WiredTigerRecoveryUnitD1Ev"},{"b":"46400000","o":"DD8938","s":"_ZN5mongo22WiredTigerRecoveryUnitD0Ev"},{"b":"46400000","o":"286F430","s":"_ZNK5mongo14ServiceContext23OperationContextDeleterclEPNS_16OperationContextE"},{"b":"46400000","o":"FA369C","s":"_ZN5mongo10TTLMonitor9doTTLPassEv"},{"b":"46400000","o":"FA3C78","s":"_ZN5mongo10TTLMonitor3runEv"},{"b":"46400000","o":"28646FC","s":"_ZN5mongo13BackgroundJob7jobBodyEv"},{"b":"46400000","o":"2AB6C24"},{"b":"3FFF84E60000","o":"8944"},{"b":"3FFF84C80000","o":"117640","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.7.9-130-gcf1cb92", "gitVersion" : "cf1cb92a0d1c7ca4156cc65e8af440f4a8678219", "compiledModules" : [ "enterprise" ], "uname" : { "sysname" : "Linux", "release" : "3.10.0-229.ael7b.ppc64le", "version" : "#1 SMP Fri Jan 30 12:03:50 EST 2015", "machine" : "ppc64le" }, "somap" : [ { "b" : "46400000", "elfType" : 3, "buildId" : "469FC351D7C70B319E47E65F0CAD5AAF79DCAD2B" }, { "b" : "3FFF85C60000", "elfType" : 3, "buildId" : "179B6FB935AA023578009C7C3C0B7F09BC50ACB9" }, { "b" : "3FFF85920000", "path" : "/lib64/libnetsnmpmibs.so.31", "elfType" : 3, "buildId" : "ED5B3DACDB1C89AAD56DCCC28E56D547B1BA79CD" }, { "b" : "3FFF858F0000", "path" : "/lib64/libsensors.so.4", "elfType" : 3, "buildId" : "8D1FADE0D56540352C0C0F4CE308D96BA3F85E70" }, { "b" : "3FFF858C0000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "57734949376D241AF73F054863FB53E1F93D222D" }, { "b" : "3FFF85820000", "path" : "/lib64/librpm.so.3", "elfType" : 3, "buildId" : "13C0FC9BB78549D8B7849AFFEF78B1C7C9795C9F" }, { "b" : "3FFF857C0000", "path" : "/lib64/librpmio.so.3", "elfType" : 3, "buildId" : "8327016318024432B934A0FC928DDCECE10312E9" }, { "b" : "3FFF85710000", "path" : "/lib64/libnetsnmpagent.so.31", "elfType" : 3, "buildId" : "7101FE545A10E550BEC342129449705A1A728804" }, { "b" : "3FFF856E0000", "path" : "/lib64/libwrap.so.0", "elfType" : 3, "buildId" : "F5F1A860B171505B7A8E2AFDEFB3325E8E66231B" }, { "b" : "3FFF85590000", "path" : "/lib64/libnetsnmp.so.31", "elfType" : 3, "buildId" : "516868CBB10374155A384239D1A189CD248A1743" }, { "b" : "3FFF854F0000", "path" : "/lib64/libssl.so.10", "elfType" : 3, "buildId" : "E46B75B46C31B1E307D0C04DE48C42E91BBF6219" }, { "b" : "3FFF85270000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "AB6F9785D492FDF0F4A2AC06C7D08DA8889E0C36" }, { "b" : "3FFF851F0000", "path" : "/lib64/libldap-2.4.so.2", "elfType" : 3, "buildId" : "05EEB4AD5E86B9B75EA521B67F6BD98EC5B1896B" }, { "b" : "3FFF851B0000", "path" : "/lib64/liblber-2.4.so.2", "elfType" : 3, "buildId" : "ED908B8E82E86713B780510C94F37E8D0DA97D05" }, { "b" : "3FFF85160000", "path" : "/lib64/libsasl2.so.3", "elfType" : 3, "buildId" : "2559B3B52A1555DD3CB244ABB24C0F2A9C08B366" }, { "b" : "3FFF850E0000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "3CDBA9444A5FC8958F5749ACF83E8B51DA626FBF" }, { "b" : "3FFF85040000", "path" : "/lib64/libcurl.so.4", "elfType" : 3, "buildId" : "3DDFB615CF269A8A71319B78D464E2B762D87D9B" }, { "b" : "3FFF84F50000", "path" : "/lib64/power8/libm.so.6", "elfType" : 3, "buildId" : "0CAADB7D836B16CDD176833A0DAE38764E28006F" }, { "b" : "3FFF84F10000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "8EE9D450F181F791C2FB0C30DE3112E7BEF26E3A" }, { "b" : "3FFF84EE0000", "path" : "/lib64/power8/librt.so.1", "elfType" : 3, "buildId" : "626D07527A63941827A52A38E928E1C6807D3821" }, { "b" : "3FFF84EA0000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "4E2CAE4975DA8B74EA3059838BDD2ED50EA5EB2A" }, { "b" : "3FFF84E60000", "path" : "/lib64/power8/libpthread.so.0", "elfType" : 3, "buildId" : "4C0A14BCEE88F2C83BC28A4ADF87DD3A74AD853F" }, { "b" : "3FFF84C80000", "path" : "/lib64/power8/libc.so.6", "elfType" : 3, "buildId" : "552EC71564694CA8066C907BD54328EE0C250BCF" }, { "b" : "3FFF85C80000", "path" : "/lib64/ld64.so.2", "elfType" : 3, "buildId" : "9F622117176D7D51DC0F827E25352994B88A5489" }, { "b" : "3FFF84AA0000", "path" : "/usr/lib64/perl5/CORE/libperl.so", "elfType" : 3, "buildId" : "70D11256C4702E4FBAE6B58E876818C6A76476D5" }, { "b" : "3FFF84A60000", "path" : "/lib64/libnsl.so.1", "elfType" : 3, "buildId" : "6B2BF9B1B99F8A27C3DDEA4905D5AD5706232491" }, { "b" : "3FFF84A10000", "path" : "/lib64/libcrypt.so.1", "elfType" : 3, "buildId" : "97AC40E63A13D4E5404A75A66C57C4F63B7A3425" }, { "b" : "3FFF849E0000", "path" : "/lib64/libutil.so.1", "elfType" : 3, "buildId" : "A06F2B58CA09E0BF147841CBB1673F8A4227E5B4" }, { "b" : "3FFF84860000", "path" : "/lib64/libnss3.so", "elfType" : 3, "buildId" : "073498A05A8D524707E80B0ADDEAA7D747107A63" }, { "b" : "3FFF84820000", "path" : "/lib64/libbz2.so.1", "elfType" : 3, "buildId" : "16DDE2253C5115465E86C884D5A3E6A6A9D47133" }, { "b" : "3FFF847E0000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "39B56D8C7ACB8559B96CA80488A0CC5D60E5215F" }, { "b" : "3FFF847A0000", "path" : "/lib64/libelf.so.1", "elfType" : 3, "buildId" : "AB56674789921A58C89401ABFFA80A6DF0B38114" }, { "b" : "3FFF84740000", "path" : "/lib64/liblzma.so.5", "elfType" : 3, "buildId" : "A6A50FDF1BC1473396EEA9135234D1093E5F8F3A" }, { "b" : "3FFF84710000", "path" : "/lib64/libpopt.so.0", "elfType" : 3, "buildId" : "60B618082A24CD52235B29E542B374F68EB35C9E" }, { "b" : "3FFF846C0000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "0157498135739C8EC8A44CB911D45784BEEF66F0" }, { "b" : "3FFF84690000", "path" : "/lib64/libcap.so.2", "elfType" : 3, "buildId" : "FDD9669E60A5F4BB9A0D4AAD8FEACE1729003CCB" }, { "b" : "3FFF84660000", "path" : "/lib64/libacl.so.1", "elfType" : 3, "buildId" : "D95B880186196B51D0A5A28D0A9223F2D9873BEB" }, { "b" : "3FFF84600000", "path" : "/lib64/liblua-5.1.so", "elfType" : 3, "buildId" : "1A0DEC8FB9A2DA209A417A8B36105ECFB4C7D33D" }, { "b" : "3FFF843F0000", "path" : "/lib64/libdb-5.3.so", "elfType" : 3, "buildId" : "9C3A59FEA55DE5E3F95C04CAC7A282E4EB1E0054" }, { "b" : "3FFF842D0000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "D5C531B4F180EACDA6E668517C9C91C2E386B51D" }, { "b" : "3FFF842A0000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "D3778E197923CB441680BF2B0D4BD3FA51CF4B38" }, { "b" : "3FFF84240000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "C9B3E6C2792B6244EF254D26AC44E5E841AD5CA0" }, { "b" : "3FFF841C0000", "path" : "/lib64/libssl3.so", "elfType" : 3, "buildId" : "9A22C14B6BBFBD05EC534B313D772C49BE5311BD" }, { "b" : "3FFF84170000", "path" : "/lib64/libsmime3.so", "elfType" : 3, "buildId" : "33441AEA8B19A242A0D288EEE2423547E1B2243C" }, { "b" : "3FFF84110000", "path" : "/lib64/libnssutil3.so", "elfType" : 3, "buildId" : "560A3EEBE98BEAE169D217DC2EA918E408829C30" }, { "b" : "3FFF840E0000", "path" : "/lib64/libplds4.so", "elfType" : 3, "buildId" : "6FB273CE4ABD4B2ADCDA8DAED5408688FB2E6BF2" }, { "b" : "3FFF840A0000", "path" : "/lib64/libplc4.so", "elfType" : 3, "buildId" : "2640F6F5414CCA44E1B62CC76DB68E20F7D409D7" }, { "b" : "3FFF84030000", "path" : "/lib64/libnspr4.so", "elfType" : 3, "buildId" : "675263E4FB3DCC6188EFDC3A2E48B4407A91D1A3" }, { "b" : "3FFF84000000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "0F1B10767A26EFDE3F342985BA62BF06D83845DF" }, { "b" : "3FFF83FD0000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "A1142DE6F702C5B8DA7B0213B1147C8442EF6CD0" }, { "b" : "3FFF83F70000", "path" : "/lib64/libidn.so.11", "elfType" : 3, "buildId" : "F5478CCFB5C18355814D7A0D46E732117DE3506B" }, { "b" : "3FFF83F10000", "path" : "/lib64/libssh2.so.1", "elfType" : 3, "buildId" : "B8881677E2B40446F81DEFD84B9F38389AAE2DF1" }, { "b" : "3FFF83EE0000", "path" : "/lib64/libfreebl3.so", "elfType" : 3, "buildId" : "F8C761364FCAFD7071CEBF52BB580EE2B064A441" }, { "b" : "3FFF83E70000", "path" : "/lib64/libpcre.so.1", "elfType" : 3, "buildId" : "98C137242BC1780C52C3631508B28018CCF1FD33" }, { "b" : "3FFF83E40000", "path" : "/lib64/libattr.so.1", "elfType" : 3, "buildId" : "5321621DC83CD8C8A066A012ADFF4678A6816AB7" } ] }}
[ShardedClusterFixture:job9:shard0]  mongod(_ZN5mongo15printStackTraceERSo+0x80) [0x48d68e70]
[ShardedClusterFixture:job9:shard0]  mongod(+0x2967C94) [0x48d67c94]
[ShardedClusterFixture:job9:shard0]  mongod(+0x29682F4) [0x48d682f4]
[ShardedClusterFixture:job9:shard0]  (__kernel_sigtramp_rt64+0x0) [0x3fff85c60478]
[ShardedClusterFixture:job9:shard0]  libc.so.6(abort+0x27C) [0x3fff84cc072c]
[ShardedClusterFixture:job9:shard0]  mongod(_ZN5mongo15invariantFailedEPKcS1_j+0x1C4) [0x46e0acf4]
[ShardedClusterFixture:job9:shard0]  mongod(_ZNK5mongo20WiredTigerSizeStorer11_checkMagicEv+0x134) [0x471de0d4]
[ShardedClusterFixture:job9:shard0]  mongod(_ZN5mongo20WiredTigerSizeStorer9syncCacheEb+0xA8) [0x471df648]
[ShardedClusterFixture:job9:shard0]  mongod(_ZNK5mongo18WiredTigerKVEngine12syncSizeInfoEb+0x2C) [0x471b334c]
[ShardedClusterFixture:job9:shard0]  mongod(_ZNK5mongo18WiredTigerKVEngine15haveDropsQueuedEv+0x100) [0x471baf80]
[ShardedClusterFixture:job9:shard0]  mongod(_ZN5mongo22WiredTigerSessionCache14releaseSessionEPNS_17WiredTigerSessionE+0x190) [0x471dbf10]
[ShardedClusterFixture:job9:shard0]  mongod(_ZN5mongo22WiredTigerRecoveryUnitD1Ev+0xD8) [0x471d88b8]
[ShardedClusterFixture:job9:shard0]  mongod(_ZN5mongo22WiredTigerRecoveryUnitD0Ev+0x28) [0x471d8938]
[ShardedClusterFixture:job9:shard0]  mongod(_ZNK5mongo14ServiceContext23OperationContextDeleterclEPNS_16OperationContextE+0x190) [0x48c6f430]
[ShardedClusterFixture:job9:shard0]  mongod(_ZN5mongo10TTLMonitor9doTTLPassEv+0x9BC) [0x473a369c]
[ShardedClusterFixture:job9:shard0]  mongod(_ZN5mongo10TTLMonitor3runEv+0x3A8) [0x473a3c78]
[ShardedClusterFixture:job9:shard0]  mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0x15C) [0x48c646fc]
[ShardedClusterFixture:job9:shard0]  mongod(+0x2AB6C24) [0x48eb6c24]
[ShardedClusterFixture:job9:shard0]  libpthread.so.0(+0x8944) [0x3fff84e68944]
[ShardedClusterFixture:job9:shard0]  libc.so.6(clone+0x98) [0x3fff84d97640]
[ShardedClusterFixture:job9:shard0] -----  END BACKTRACE  -----



 Comments   
Comment by Githook User [ 28/Apr/20 ]

Author:

{'name': 'Jamie Heppenstall', 'email': 'jamie.heppenstall@mongodb.com', 'username': 'JamesHeppenstall'}

Message: SERVER-36841 Fix race between TTL thread and shutdown
Branch: master
https://github.com/mongodb/mongo/commit/f0cf2ca62b4957587e5fe3a897b35b9a54873e23

Comment by Dianna Hohensee (Inactive) [ 02/Jul/19 ]

It was recently concluded that we wish to delete the BackgroundJob interface that the TTLMonitor class implements, so this ticket should move the TTLMonitor off of that and implement the usual shutdown() logic.

Comment by Mira Carey [ 04/Sep/18 ]

While this could be solved by global clean shutdown, I don't think we intend to do that any time soon.

As such, this is a status quo race between shutdown and the ttl thread, which should involve modifying the ttl thread to fix test failures.

Bouncing back to storage

Comment by Ian Whalen (Inactive) [ 30/Aug/18 ]

Based on a convo with Greg, it sounds like Platform didn't intend to do this any time soon so it was moved to Query under the "it was going to get closed otherwise" rubric. As such, it sounds like this should still live with Platform. And because this is related to fixing a test failure, if this is closed please be sure to link a new SERVER ticket to that test failure.

Comment by Ian Whalen (Inactive) [ 30/Aug/18 ]

sara.williamson just a reminder during triage meetings to always please add a comment when moving to a new team so they know why it was moved over.

Comment by Eric Milkie [ 24/Aug/18 ]

I believe the way to fix this would be to remove the polling of the shutdown flag that the TTL loop does, replaced with a better mechanism that will be produced by the Global clean shutdown project.

Generated at Thu Feb 08 04:44:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.