[SERVER-24283] Invariant failure grantedCounts[mode] >= 1 Created: 25/May/16  Updated: 10/May/23  Resolved: 15/Jun/16

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: 3.2.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Eric Le Lay Assignee: Kaloian Manassiev
Resolution: Cannot Reproduce Votes: 0
Labels: mmapv1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-26578 Add startup warning for Intel CPUs wh... Closed
Operating System: ALL
Steps To Reproduce:
  • mongod is running on three boxes.
  • master : box2
  • slaves (forced priority 0) : box4 and box5
  • mongo crashed on box2, not under load at all
  • immediately shows errors on box4 and box5
  • no error after reboot
Sprint: Sharding 15 (06/03/16), Sharding 16 (06/24/16)
Participants:
Linked BF Score: 5

 Description   

Mongo running in a 3 machines cluster crashed with invariant failure on the primary.
I have full debug level logs of the event but cannot reproduce it.
I rebooted the 3 machines the next morning to bring them back to normal.

# mongod -version
db version v3.2.6
git version: 05552b562c7a0b3143a729aaa0838e558dc49b25
OpenSSL version: OpenSSL 1.0.2d 9 Jul 2015
allocator: tcmalloc
modules: none
build environment:
    distmod: ubuntu1404
    distarch: x86_64
    target_arch: x86_64

LOG EXCERPT ON THE PRIMARY

2016-05-18T20:48:18.627+0200 I COMMAND  [conn813] getmore local.oplog.rs query: { ns: { $regex: "^blah\." }, $or: [ { op: { $in: [ "i", "u", "d" ] } }, { op: "c", o.drop: { $exists: true } }, { op: "c", o.dropDatabase: 1 } ], ts: { $gt: Timestamp 1463580273000|2 } } cursorid:26977319316 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 6 } }, MMAPV1Journal: { acquireCount: { r: 3 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { R: 3 } } } 1000ms
2016-05-18T20:48:18.627+0200 I -        [conn511] Invariant failure grantedCounts[mode] >= 1 src/mongo/db/concurrency/lock_manager.cpp 251
2016-05-18T20:48:18.627+0200 I COMMAND  [conn661] getmore local.oplog.rs query: { ns: { $regex: "^blah\." }, $or: [ { op: { $in: [ "i", "u", "d" ] } }, { op: "c", o.drop: { $exists: true } }, { op: "c", o.dropDatabase: 1 } ], ts: { $gt: Timestamp 1463580273000|1 } } cursorid:27932375020 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 6 } }, MMAPV1Journal: { acquireCount: { r: 3 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { R: 3 } } } 1000ms
2016-05-18T20:48:18.627+0200 I -        [conn511]
 
***aborting after invariant() failure
 
 
2016-05-18T20:48:18.639+0200 F -        [conn511] Got signal: 6 (Aborted).
 
 0x1315982 0x1314ad9 0x13152e2 0x7f295deb4d10 0x7f295db0f267 0x7f295db10eca 0x129f8fb 0xbb046d 0xbb7206 0xbbac1c 0xbc4c85 0xde9f76 0xccee49 0xcd5115 0x9b78cc 0x12c2e7d 0x7f295deab6aa 0x7f295dbe0e9d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"F15982","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"F14AD9"},{"b":"400000","o":"F152E2"},{"b":"7F295DEA4000","o":"10D10"},{"b":"7F295DADA000","o":"35267","s":"gsignal"},{"b":"7F295DADA000","o":"36ECA","s":"abort"},{"b":"400000","o":"E9F8FB","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"400000","o":"7B046D","s":"_ZN5mongo11LockManager6unlockEPNS_11LockRequestE"},{"b":"400000","o":"7B7206","s":"_ZN5mongo10LockerImplILb1EE11_unlockImplEPNS_14FastMapNoAllocINS_10ResourceIdENS_11LockRequestELi16EE12IteratorImplIS5_S4_EE"},{"b":"400000","o":"7BAC1C","s":"_ZN5mongo10LockerImplILb1EE6unlockENS_10ResourceIdE"},{"b":"400000","o":"7C4C85","s":"_ZN5mongo24AutoGetCollectionForReadD1Ev"},{"b":"400000","o":"9E9F76","s":"_ZN5mongo7getMoreEPNS_16OperationContextEPKcixPbS4_"},{"b":"400000","o":"8CEE49","s":"_ZN5mongo15receivedGetMoreEPNS_16OperationContextERNS_10DbResponseERNS_7MessageERNS_5CurOpE"},{"b":"400000","o":"8D5115","s":"_ZN5mongo16assembleR esponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE"},{"b":"400000","o":"5B78CC","s":"_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE"},{"b":"400000","o":"EC2E7D","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"},{"b":"7F295DEA4000","o":"76AA"},{"b":"7F295DADA000","o":"106E9D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.6", "gitVersion" : "05552b562c7a0b3143a729aaa0838e558dc49b25", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.2.0-36-generic", "version" : "#41-Ubuntu SMP Mon Apr 18 15:49:10 UTC 2016", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "B35B0D2B9CB934315BA086AB5C88C919B9B5BA50" }, { "b" : "7FFF7BBF1000", "elfType" : 3, "buildId" : "9797E1859293A4899C7420402AC699C96923910C" }, { "b" : "7F295EE31000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "96D927A52B6A405C147AC4D3F8A6F14CC31316BA" }, { "b" : "7F295E9ED000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "039AD0290D6DDCD62FFAAFF6D241FD313938E654" }, { "b" : "7F295E7E5000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "7ED450BD936778B73D53F9E69AD2E3E0BD614277" }, { "b" : "7F295E5E1000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "A20BCC32D4C15FAB47EEC600027EF571699A03FD" }, { "b" : "7F295E2D9000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "42FC4D8B6B70D1A8469B3A1A0164A2559F4E17C3" }, { "b" : "7F295E0C2000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "0C3C07EE15CFA81346847A679E8444B876D9CC58" }, { "b" : "7F295DEA4000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "E68326F6C5F4C01A31EE635E6563BE58749F3AB4" }, { "b" : "7F295DADA000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "59EAE8A903584DCBC14CF07C719CD23B9F65E230" }, { "b" : "7F295F09A000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "CC2853E4A748C809FA82EA774FBACE3AD67DAFED" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x1315982]
 mongod(+0xF14AD9) [0x1314ad9]
 mongod(+0xF152E2) [0x13152e2]
 libpthread.so.0(+0x10D10) [0x7f295deb4d10]
 libc.so.6(gsignal+0x37) [0x7f295db0f267]
 libc.so.6(abort+0x16A) [0x7f295db10eca]
 mongod(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0x129f8fb]
 mongod(_ZN5mongo11LockManager6unlockEPNS_11LockRequestE+0x44D) [0xbb046d]
 mongod(_ZN5mongo10LockerImplILb1EE11_unlockImplEPNS_14FastMapNoAllocINS_10ResourceIdENS_11LockRequestELi16EE12IteratorImplIS5_S4_EE+0x36) [0xbb7206]
 mongod(_ZN5mongo10LockerImplILb1EE6unlockENS_10ResourceIdE+0x10C) [0xbbac1c]
 mongod(_ZN5mongo24AutoGetCollectionForReadD1Ev+0xB5) [0xbc4c85]
 mongod(_ZN5mongo7getMoreEPNS_16OperationContextEPKcixPbS4_+0x866) [0xde9f76]
 mongod(_ZN5mongo15receivedGetMoreEPNS_16OperationContextERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1A9) [0xccee49]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xE35) [0xcd5115]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0xEC) [0x9b78cc]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x26D) [0x12c2e7d]
 libpthread.so.0(+0x76AA) [0x7f295deab6aa]
 libc.so.6(clone+0x6D) [0x7f295dbe0e9d]
-----  END BACKTRACE  -----



 Comments   
Comment by Kaloian Manassiev [ 11/Oct/16 ]

Hi ericeliga,

Another customer recently discovered a similar lock manager problem which turned out to be due to bugs in the Intel's TSX microcode (SERVER-26018). We have filed SERVER-26578 to issue a startup warning if we discover this particular situation and the ticket contains more information.

Are you still experiencing this invariant failure and if so would it be possible to report the chipset which the machine is running?

Thank you in advance.

Best regards,
-Kal.

Comment by Ramon Fernandez Marina [ 15/Jun/16 ]

ericeliga, the additional checks mentioned above are described in SERVER-24316.

Comment by Kaloian Manassiev [ 15/Jun/16 ]

Hi,

Apologies for the late reply.

We have been unable to reproduce this issue on our side and the information in the call stacks is insufficient to figure out the root cause. As part of the investigation we inspected the code and found no obvious bugs and in addition we checked in some extra assertions and tests to help us catch this at an earlier stage should it happen in the future and we will back-port them to 3.2.

For now though, there is nothing more that we can do, so I am going to close this ticket as 'Cannot reproduce' until we get more information, should it happen with the additional checks in place.

Apologies again for the inconvenience and thank you very much for providing us with debugging information.

Best regards,
-Kal.

Comment by Eric Le Lay [ 15/Jun/16 ]

Hi, are you waiting for more info from my side?
I'll have to repurpose those boxes eventually...

Comment by Kaloian Manassiev [ 31/May/16 ]

This is running on MMAP V1 and from the getmores, which happened before the invariant was hit it can be seen that the oplog collection lock is being acquired in MODE_S (the last entry in the log line below):

2016-05-18T20:48:18.627+0200 I COMMAND  [conn813] getmore local.oplog.rs 
    query: { ns: { $regex: "^blah\." }, $or: [ { op: { $in: [ "i", "u", "d" ] } }, { op: "c", o.drop: { $exists: true } }, { op: "c", o.dropDatabase: 1 } ], ts: { $gt: Timestamp 1463580273000|2 } }
    cursorid:26977319316 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:0 reslen:20 
    locks:{ Global: { acquireCount: { r: 6 } }, MMAPV1Journal: { acquireCount: { r: 3 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { R: 3 } } } 1000ms

Also, from the format of the log line, it looks like the caller is using legacy protocol codes and not the new getmore command.

This is an example of how to exercise this code path:

./mongo --port 20000 --readMode legacy
...
var oplogDb = db.getSisterDB('local');
var cursorOptions = { tailable: true, awaitdata: true, numberOfRetries: -1 };
oplogDb.oplog.rs.find({}, cursorOptions).sort({$natural: -1}).forEach(function(entry) { });

The oplog collection lock is acquired in MODE_S because this is MMAP V1 node.

Comment by Eric Le Lay [ 26/May/16 ]

I uploaded mongod_clean_1.log.bz2.
I should mention that we use mongodb for meteor (meteor.com) applications running on box2, box4 and box5. They all observe the oplog on box2.

Comment by Ramon Fernandez Marina [ 26/May/16 ]

Hi ericeliga, you can use this upload portal to send us files for this ticket securely and privately.

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 25/May/16 ]

ericeliga, can you please upload the full logs from the affected node from the last restart until the time of the invariant failure?

Thanks,
Ramón.

Comment by Andy Schwerin [ 25/May/16 ]

My read of this is that we're either unlocking a different mode that we locked, or that somebody else double-unlocked.

Comment by Ramon Fernandez Marina [ 25/May/16 ]

Thanks for your report ericeliga, we're taking a look at the stack trace and we'll post updates in this ticket when we have them.

Regards,
Ramón.

Comment by Ramon Fernandez Marina [ 25/May/16 ]

$ addr2line -ipsf -C cpp -e mongodb-linux-x86_64-ubuntu1404-3.2.6/mongod.debug `cat bt`
?? ??:0
mongo::printStackTrace(std::ostream&) at stacktrace_posix.cpp:172
printSignalAndBacktrace at signal_handlers_synchronous.cpp:182
abruptQuit at signal_handlers_synchronous.cpp:238
?? ??:0
?? ??:0
?? ??:0
mongo::invariantFailed(char const*, char const*, unsigned int) at assert_util.cpp:154 (discriminator 1)
mongo::LockHead::decGrantedModeCount(mongo::LockMode) at lock_manager.cpp:251
 (inlined by) mongo::LockManager::unlock(mongo::LockRequest*) at lock_manager.cpp:628
mongo::LockerImpl<true>::_unlockImpl(mongo::FastMapNoAlloc<mongo::ResourceId, mongo::LockRequest, 16>::IteratorImpl<mongo::FastMapNoAlloc<mongo::ResourceId, mongo::LockRequest, 16>, mongo::LockRequest>*) at lock_state.cpp:787
mongo::LockerImpl<true>::unlock(mongo::ResourceId) at lock_state.cpp:455
~AutoGetDb at db_raii.h:51
 (inlined by) ~AutoGetCollection at db_raii.h:73
 (inlined by) boost::optional_detail::optional_base<mongo::AutoGetCollection>::destroy_impl(mpl_::bool_<false>) at optional.hpp:729
 (inlined by) boost::optional_detail::optional_base<mongo::AutoGetCollection>::destroy() at optional.hpp:691
 (inlined by) ~optional_base at optional.hpp:334
 (inlined by) ~optional at optional.hpp:861
 (inlined by) mongo::AutoGetCollectionForRead::~AutoGetCollectionForRead() at db_raii.cpp:101
std::default_delete<mongo::AutoGetCollectionForRead>::operator()(mongo::AutoGetCollectionForRead*) const at unique_ptr.h:67
 (inlined by) ~unique_ptr at unique_ptr.h:184
 (inlined by) mongo::getMore(mongo::OperationContext*, char const*, int, long long, bool*, bool*) at find.cpp:284
~Status at status-inl.h:64
 (inlined by) mongo::receivedGetMore(mongo::OperationContext*, mongo::DbResponse&, mongo::Message&, mongo::CurOp&) at instance.cpp:902
mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) at instance.cpp:528
~basic_string at basic_string.h:539
 (inlined by) ~HostAndPort at hostandport.h:49
 (inlined by) mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*) at db.cpp:177
mongo::PortMessageServer::handleIncomingMsg(void*) at message_server_port.cpp:231

Generated at Thu Feb 08 04:05:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.