[SERVER-16763] mongod terminate due to mongo::DBTryLockTimeoutException during longevity test with wiredTiger Created: 07/Jan/15  Updated: 23/Jan/15  Resolved: 15/Jan/15

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: 2.8.0-rc3, 2.8.0-rc4
Fix Version/s: 3.0.0-rc6

Type: Bug Priority: Critical - P2
Reporter: Rui Zhang (Inactive) Assignee: Kaloian Manassiev
Resolution: Done Votes: 0
Labels: 28qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-4740 Use monotonic clock sources for Timer Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

mongod crashed during mixed read/write traffic testing, the thread raise the exception is shard related, which does moveChunk.

This happens after about 3 days of execution, reproduced the same issue with rc3 & rc4.

here is log about the crash (from rc4)

2015-01-05T22:54:49.372+0000 F -        [conn60] terminate() called. An exception is active; attempting to gather more information
2015-01-05T22:54:49.441+0000 F -        [conn60] std::exception::what(): std::exception
Actual exception type: mongo::DBTryLockTimeoutException
 
 0xf133b9 0xf12eb0 0x7fb3ac8bb6c6 0x7fb3ac8ba789 0x7fb3ac8bb33a 0x7fb3ac358913 0x7fb3ac358e47 0x9954a4 0xdac52c 0xdae3f0 0x9ad054 0x9adf93 0x9aea4b 0xb7ca1a 0xa8fcd5 0x7e41f0 0xed1381 0x7fb3acf74f18 0x7fb3ac086b9d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B133B9"},{"b":"400000","o":"B12EB0"},{"b":"7FB3AC85D000","o":"5E6C6"},{"b":"7FB3AC85D000","o":"5D789"},{"b":"7FB3AC85D000","o":"5E33A"},{"b":"7FB3AC349000","o":"F913"},{"b":"7FB3AC349000","o":"FE47"},{"b":"400000","o":"5954A4"},{"b":"400000","o":"9AC52C"},{"b":"400000","o":"9AE3F0"},{"b":"400000","o":"5AD054"},{"b":"400000","o":"5ADF93"},{"b":"400000","o":"5AEA4B"},{"b":"400000","o":"77CA1A"},{"b":"400000","o":"68FCD5"},{"b":"400000","o":"3E41F0"},{"b":"400000","o":"AD1381"},{"b":"7FB3ACF6D000","o":"7F18"},{"b":"7FB3ABFA4000","o":"E2B9D"}],"processInfo":{ "mongodbVersion" : "2.8.0-rc4", "gitVersion" : "3ad571742911f04b307f0071979425511c4f2570", "uname" : { "sysname" : "Linux", "release" : "3.14.19-17.43.amzn1.x86_64", "version" : "#1 SMP Wed Sep 17 22:14:52 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFFA4AFE000", "elfType" : 3 }, { "b" : "7FB3ACF6D000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7FB3ACD65000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7FB3ACB61000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7FB3AC85D000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7FB3AC55F000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7FB3AC349000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7FB3ABFA4000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7FB3AD189000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf133b9]
 mongod(+0xB12EB0) [0xf12eb0]
 libstdc++.so.6(+0x5E6C6) [0x7fb3ac8bb6c6]
 libstdc++.so.6(+0x5D789) [0x7fb3ac8ba789]
 libstdc++.so.6(__gxx_personality_v0+0x52A) [0x7fb3ac8bb33a]
 libgcc_s.so.1(+0xF913) [0x7fb3ac358913]
 libgcc_s.so.1(_Unwind_Resume+0x57) [0x7fb3ac358e47]
 mongod(_ZN5mongo4Lock10GlobalReadC2EPNS_6LockerEj+0x84) [0x9954a4]
 mongod(_ZN5mongo17MigrateFromStatus4doneEPNS_16OperationContextE+0x8C) [0xdac52c]
 mongod(_ZN5mongo16MoveChunkCommand3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x1CE0) [0xdae3f0]
 mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9ad054]
 mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC13) [0x9adf93]
 mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9aea4b]
 mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERNS_5CurOpES3_b+0x76A) [0xb7ca1a]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0xB25) [0xa8fcd5]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xE0) [0x7e41f0]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x411) [0xed1381]
 libpthread.so.0(+0x7F18) [0x7fb3acf74f18]
 libc.so.6(clone+0x6D) [0x7fb3ac086b9d]
-----  END BACKTRACE  -----

few more events related to conn60 before the crash

2015-01-05T17:47:22.452+0000 I SHARDING [conn60] moveChunk data transfer progress: { active: true, ns: "sbtest.sbtest1", from: "rs2/172.31.32.214:27017,ip-172-31-35-229:27017", min: { _id: -7816322693657637576 }, max: { _id: -7672769179660119751 }, shardKeyPattern: { _id: "hashed" }, state: "clone", counts: { cloned: 1480, clonedBytes: 321160, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 6
 
2015-01-05T17:47:22.602+0000 I SHARDING [conn60] About to check if it is safe to enter critical section
2015-01-05T17:47:22.602+0000 E SHARDING [conn60] moveChunk cannot enter critical section before all data is cloned, 81584 locs were not transferred but to-shard reported { active: true, ns: "sbtest.sbtest1", from: "rs2/172.31.32.214:27017,ip-172-31-35-229:27017", min: { _id: -7816322693657637576 }, max: { _id: -7672769179660119751 }, shardKeyPattern: { _id: "hashed" }, state: "clone", counts: { cloned: 1480, clonedBytes: 321160, catchup: 0, steady: 0 }, ok: 1.0 }
2015-01-05T17:47:22.602+0000 I SHARDING [conn60] MigrateFromStatus::done About to acquire global lock to exit critical section

the setup is

  • 3 config server
  • 1 mongos
  • 3 shards, each with two member replication set
  • wiredTiger
  • all options default


 Comments   
Comment by Kaloian Manassiev [ 15/Jan/15 ]

A fix was pushed so that when no timeout is requested, lock acquisitions won't depend on the clock, just wait forever (updating the wait statistics in between waits). This will take care of jumping clocks.

For the timeout cases, it is valid for the wait call to return earlier, since the code expects it.

Comment by Githook User [ 15/Jan/15 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-16763 Wait infinitely when no timeout has been requested
Branch: master
https://github.com/mongodb/mongo/commit/fb734fd9183e2929cc48d51988110c76d0d250a7

Comment by Eric Milkie [ 07/Jan/15 ]

Actually, that technique might not work since we added true yielding for WiredTiger.

Comment by Eric Milkie [ 07/Jan/15 ]

Rui, I would expect you could get the server to crash by using WiredTiger and building a foreground index that takes longer than 71 minutes, and at the same time ran a command that needs a global lock such as the "sleep" test command.

Comment by Eric Milkie [ 07/Jan/15 ]

Since we pass in UINT_MAX milliseconds for grabbing a Global lock, that translates to about 71 minutes before it gives up. This is a problem, since it's possible that things like index builds could take longer than 71 minutes and thus could throw a lock timeout exception unnecessarily in a thread waiting for a global lock.

Comment by Rui Zhang (Inactive) [ 07/Jan/15 ]

with rc3

2014-12-23T03:53:51.120+0000 I -        [conn69] LockerId 94 has been waiting to acquire lock for more than 30 seconds. MongoDB will print the lock manager state and the stack of the thread that has been waiting, for diagnostic purposes. This message does not necessary imply that the server is experiencing an outage, but might be an indication of an overload.
2014-12-23T03:53:51.120+0000 I -        [conn69] Dumping LockManager @ 0x1ae7480
2014-12-23T03:53:51.120+0000 I -        [conn69] Lock @ 0x4020700: {2305843009213693953: Global, 1}
GRANTED:
        LockRequest 2790 @ 0x43BA5F100: Mode = IX; ConvertMode = NONE; EnqueueAtFront = 0; CompatibleFirst = 0;
 
PENDING:
        LockRequest 94 @ 0x10FABC00: Mode = S; ConvertMode = NONE; EnqueueAtFront = 1; CompatibleFirst = 1;
2014-12-23T03:53:51.120+0000 I -        [conn69]
2014-12-23T03:53:51.126+0000 I -        [conn69]
 0xf0bd99 0x99966a 0x9a13cc 0x99a84b 0x9917fd 0xda565c 0xda751a 0x9a9784 0x9aa5d3 0x9ab08b 0xb77c2a 0xa8af55 0x7e1770 0xec9d61 0x7f3424bd1f18 0x7f3423ce3b9d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B0BD99"},{"b":"400000","o":"59966A"},{"b":"400000","o":"5A13CC"},{"b":"400000","o":"59A84B"},{"b":"400000","o":"5917FD"},{"b":"400000","o":"9A565C"},{"b":"400000","o":"9A751A"},{"b":"400000","o":"5A9784"},{"b":"400000","o":"5AA5D3"},{"b":"400000","o":"5AB08B"},{"b":"400000","o":"777C2A"},{"b":"400000","o":"68AF55"},{"b":"400000","o":"3E1770"},{"b":"400000","o":"AC9D61"},{"b":"7F3424BCA000","o":"7F18"},{"b":"7F3423C01000","o":"E2B9D"}],"processInfo":{ "mongodbVersion" : "2.8.0-rc3", "gitVersion" : "2d679247f17dab05a492c8b6d2c250dab18e54f2", "uname" : { "sysname" : "Linux", "release" : "3.10.42-52.145.amzn1.x86_64", "version" : "#1 SMP Tue Jun 10 23:46:43 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFFE7AFE000", "elfType" : 3 }, { "b" : "7F3424BCA000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7F34249C2000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7F34247BE000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7F34244BA000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F34241BC000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7F3423FA6000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F3423C01000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7F3424DE6000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf0bd99]
 mongod(+0x59966A) [0x99966a]
 mongod(_ZN5mongo10LockerImplILb0EE12lockCompleteENS_10ResourceIdEjb+0x18C) [0x9a13cc]
 mongod(_ZN5mongo10LockerImplILb0EE10lockGlobalENS_8LockModeEj+0x2B) [0x99a84b]
 mongod(_ZN5mongo4Lock10GlobalReadC2EPNS_6LockerEj+0x2D) [0x9917fd]
 mongod(_ZN5mongo17MigrateFromStatus4doneEPNS_16OperationContextE+0x8C) [0xda565c]
 mongod(_ZN5mongo16MoveChunkCommand3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x1CDA) [0xda751a]
 mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9a9784]
 mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC13) [0x9aa5d3]
 mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9ab08b]
 mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERNS_5CurOpES3_b+0x76A) [0xb77c2a]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0xB25) [0xa8af55]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xE0) [0x7e1770]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x411) [0xec9d61]
 libpthread.so.0(+0x7F18) [0x7f3424bd1f18]
 libc.so.6(clone+0x6D) [0x7f3423ce3b9d]
-----  END BACKTRACE  -----
2014-12-23T03:53:51.530+0000 F -        [conn69] terminate() called. An exception is active; attempting to gather more information
2014-12-23T03:53:51.557+0000 F -        [conn69] std::exception::what(): std::exception
Actual exception type: mongo::DBTryLockTimeoutException
 
 0xf0bd99 0xf0b890 0x7f34245186c6 0x7f3424517789 0x7f342451833a 0x7f3423fb5913 0x7f3423fb5e47 0x991854 0xda565c 0xda751a 0x9a9784 0x9aa5d3 0x9ab08b 0xb77c2a 0xa8af55 0x7e1770 0xec9d61 0x7f3424bd1f18 0x7f3423ce3b9d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B0BD99"},{"b":"400000","o":"B0B890"},{"b":"7F34244BA000","o":"5E6C6"},{"b":"7F34244BA000","o":"5D789"},{"b":"7F34244BA000","o":"5E33A"},{"b":"7F3423FA6000","o":"F913"},{"b":"7F3423FA6000","o":"FE47"},{"b":"400000","o":"591854"},{"b":"400000","o":"9A565C"},{"b":"400000","o":"9A751A"},{"b":"400000","o":"5A9784"},{"b":"400000","o":"5AA5D3"},{"b":"400000","o":"5AB08B"},{"b":"400000","o":"777C2A"},{"b":"400000","o":"68AF55"},{"b":"400000","o":"3E1770"},{"b":"400000","o":"AC9D61"},{"b":"7F3424BCA000","o":"7F18"},{"b":"7F3423C01000","o":"E2B9D"}],"processInfo":{ "mongodbVersion" : "2.8.0-rc3", "gitVersion" : "2d679247f17dab05a492c8b6d2c250dab18e54f2", "uname" : { "sysname" : "Linux", "release" : "3.10.42-52.145.amzn1.x86_64", "version" : "#1 SMP Tue Jun 10 23:46:43 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFFE7AFE000", "elfType" : 3 }, { "b" : "7F3424BCA000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7F34249C2000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7F34247BE000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7F34244BA000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F34241BC000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7F3423FA6000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F3423C01000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7F3424DE6000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf0bd99]
 mongod(+0xB0B890) [0xf0b890]
 libstdc++.so.6(+0x5E6C6) [0x7f34245186c6]
 libstdc++.so.6(+0x5D789) [0x7f3424517789]
 libstdc++.so.6(__gxx_personality_v0+0x52A) [0x7f342451833a]
 libgcc_s.so.1(+0xF913) [0x7f3423fb5913]
 libgcc_s.so.1(_Unwind_Resume+0x57) [0x7f3423fb5e47]
 mongod(_ZN5mongo4Lock10GlobalReadC2EPNS_6LockerEj+0x84) [0x991854]
 mongod(_ZN5mongo17MigrateFromStatus4doneEPNS_16OperationContextE+0x8C) [0xda565c]
 mongod(_ZN5mongo16MoveChunkCommand3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x1CDA) [0xda751a]
 mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9a9784]
 mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC13) [0x9aa5d3]
 mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9ab08b]
 mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERNS_5CurOpES3_b+0x76A) [0xb77c2a]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0xB25) [0xa8af55]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xE0) [0x7e1770]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x411) [0xec9d61]
 libpthread.so.0(+0x7F18) [0x7f3424bd1f18]
 libc.so.6(clone+0x6D) [0x7f3423ce3b9d]
-----  END BACKTRACE  -----

Generated at Thu Feb 08 03:42:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.