[SERVER-55379] Invariant failure _requests.empty() at src/mongo/db/concurrency/lock_state.cpp 289 Created: 19/Mar/21  Updated: 06/Dec/22  Resolved: 31/Mar/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.2.12
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kelsey Schubert Assignee: Backlog - Storage Execution Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
duplicates SERVER-60685 TransactionCoordinator may interrupt ... Closed
Problem/Incident
causes SERVER-57360 Log additional debug info for the "in... Closed
Related
is related to SERVER-56251 Alleviate problems that arise when Op... Backlog
Assigned Teams:
Storage Execution
Operating System: ALL
Backport Requested:
v5.0, v4.4, v4.2
Sprint: Execution Team 2021-05-03, Execution Team 2021-05-17, Execution Team 2021-05-31, Execution Team 2021-06-14, Execution Team 2021-06-28, Execution Team 2021-07-12, Execution Team 2021-07-26
Participants:
Case:
Linked BF Score: 150

 Description   

Observed on MongoDB 4.2.12:

 

2021-03-17T13:45:33.592+0000 F - [TransactionCoordinator] Invariant failure _requests.empty() src/mongo/db/concurrency/lock_state.cpp 289
2021-03-17T13:45:33.592+0000 F - [TransactionCoordinator] \n\n***aborting after invariant() failure\n\n
2021-03-17T13:45:33.625+0000 F - [TransactionCoordinator] Got signal: 6 (Aborted).
 0x559a85513121 0x559a8551274c 0x559a855127d6 0x7f95887a8630 0x7f9588401387 0x7f9588402a78 0x559a8384a9bb 0x559a83821290 0x559a853d8ec1 0x559a8540cbe7 0x559a8540ccc1 0x559a8540f88c 0x559a83fbd151 0x559a8363e156 0x559a84bd2bd3 0x559a84bd30f2 0x559a842f7bda 0x559a842f90a1 0x559a842fa1a9 0x559a842faa72 0x559a8563a8bf 0x7f95887a0ea5 0x7f95884c996d
----- BEGIN BACKTRACE -----
{"backtrace":[\{"b":"559A82ADE000","o":"2A35121","s":"_ZN5mongo15printStackTraceERSo"},\{"b":"559A82ADE000","o":"2A3474C"},\{"b":"559A82ADE000","o":"2A347D6"},\{"b":"7F9588799000","o":"F630"},\{"b":"7F95883CB000","o":"36387","s":"gsignal"},\{"b":"7F95883CB000","o":"37A78","s":"abort"},\{"b":"559A82ADE000","o":"D6C9BB","s":"_ZN5mongo12verifyFailedEPKcS1_j"},\{"b":"559A82ADE000","o":"D43290"},\{"b":"559A82ADE000","o":"28FAEC1","s":"_ZN5mongo10LockerImplD0Ev"},\{"b":"559A82ADE000","o":"292EBE7","s":"_ZN5mongo16OperationContextD1Ev"},\{"b":"559A82ADE000","o":"292ECC1","s":"_ZN5mongo16OperationContextD0Ev"},\{"b":"559A82ADE000","o":"293188C","s":"_ZNK5mongo14ServiceContext23OperationContextDeleterclEPNS_16OperationContextE"},\{"b":"559A82ADE000","o":"14DF151"},\{"b":"559A82ADE000","o":"B60156"},\{"b":"559A82ADE000","o":"20F4BD3","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE"},\{"b":"559A82ADE000","o":"20F50F2"},\{"b":"559A82ADE000","o":"1819BDA","s":"_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockINS_12latch_detail5LatchEE"},\{"b":"559A82ADE000","o":"181B0A1","s":"_ZN5mongo10ThreadPool13_consumeTasksEv"},\{"b":"559A82ADE000","o":"181C1A9","s":"_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE"},\{"b":"559A82ADE000","o":"181CA72"},\{"b":"559A82ADE000","o":"2B5C8BF"},\{"b":"7F9588799000","o":"7EA5"},\{"b":"7F95883CB000","o":"FE96D","s":"clone"}],"processInfo":\{ "mongodbVersion" : "4.2.12", "gitVersion" : "5593fd8e33b60c75802edab304e23998fa0ce8a5", "compiledModules" : [ "enterprise" ], "uname" : { "sysname" : "Linux", "release" : "3.10.0-1160.11.1.el7.x86_64", "version" : "#1 SMP Fri Dec 18 16:34:56 UTC 2020", "machine" : "x86_64" }, "somap" : [ \{ "b" : "559A82ADE000", "elfType" : 3, "buildId" : "8787C8D20F8DCA892BA0E9F83F9BF2E33D517D8D" }, \{ "b" : "7FFF6345F000", "elfType" : 3, "buildId" : "89E25679A85219B08C3FF58CE11C35336C9BBAD7" }, \{ "b" : "7F958B785000", "path" : "/usr/lib64/libldap_r/libldap-2.4.so.2", "elfType" : 3, "buildId" : "E17DAD36A8A8D068135B66CFF68E2E55C0B7ECB9" }, \{ "b" : "7F958B576000", "path" : "/lib64/liblber-2.4.so.2", "elfType" : 3, "buildId" : "8804516A3226CFB54589FEC0E27D89C93DAF92FF" }, \{ "b" : "7F958B0BD000", "path" : "/lib64/libnetsnmpmibs.so.31", "elfType" : 3, "buildId" : "5BE1D0A9720876D669FDBAFF487E4F8CE453069C" }, \{ "b" : "7F958AEAE000", "path" : "/lib64/libsensors.so.4", "elfType" : 3, "buildId" : "A2ACE3E193F25778AA87D2E221945FDCCFCF220F" }, \{ "b" : "7F958ACAA000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "F2C36986E11A291A0D4BCB3A81632B24AE2359EA" }, \{ "b" : "7F958AA42000", "path" : "/lib64/librpm.so.3", "elfType" : 3, "buildId" : "B574D8D77B3CDE423A523F462DE67F41FB5754AA" }, \{ "b" : "7F958A815000", "path" : "/lib64/librpmio.so.3", "elfType" : 3, "buildId" : "ABCB07D58923C8CE38C5D2CCCE842AA5F4B76E51" }, \{ "b" : "7F958A5A6000", "path" : "/lib64/libnetsnmpagent.so.31", "elfType" : 3, "buildId" : "7DE39C44A6D01E12708F5A657B385CE26FCC6C03" }, \{ "b" : "7F958A39B000", "path" : "/lib64/libwrap.so.0", "elfType" : 3, "buildId" : "8C4AA46577D3AA7EBF8338BDFAECC6586EF29906" }, \{ "b" : "7F958A098000", "path" : "/lib64/libnetsnmp.so.31", "elfType" : 3, "buildId" : "AC7DB8AC490BDCEFCB7711E3F33E4357E7A8CED1" }, \{ "b" : "7F9589E26000", "path" : "/lib64/libssl.so.10", "elfType" : 3, "buildId" : "FCF9ABA6A186EA1AF5E235B747F0BA7D83ABAEE3" }, \{ "b" : "7F95899C3000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "3E0C36DE1A57CABFC7C9C319C2EDFC6F0DB8A54C" }, \{ "b" : "7F95897A6000", "path" : "/lib64/libsasl2.so.3", "elfType" : 3, "buildId" : "E2F2017F821DD1B9D307DA1A9B8014F2941AEB7B" }, \{ "b" : "7F9589559000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "0CAEC124D97114DA40DDEB0FED1FAD5D14C3D626" }, \{ "b" : "7F95892EF000", "path" : "/lib64/libcurl.so.4", "elfType" : 3, "buildId" : "AA185EEE9B8EE0BCE652F7A6E9AAD07B1B3808C4" }, \{ "b" : "7F9588FED000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "085D924F5D23B9F15A8AD28B7231EE93C09E13F1" }, \{ "b" : "7F9588DD3000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "C3B2DD93CD59A17EA97148EC98C2667ADB9987A3" }, \{ "b" : "7F9588BCB000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "CCD4BE566DD5A8FC7FA62B224C14B698F51B0D0D" }, \{ "b" : "7F95889B5000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "EDF51350C7F71496149D064AA8B1441F786DF88A" }, \{ "b" : "7F9588799000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "2B482B3BAE79DEF4E5BC9791BC6BBDAE0E93E359" }, \{ "b" : "7F95883CB000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "F9FAFDE281E0E0E2AF45911AD0FA115B64C2CEA8" }, \{ "b" : "7F958B9E4000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "F661F01D5570D658EF1EE970FE90D894B636F15A" }, \{ "b" : "7F958816E000", "path" : "/lib64/libssl3.so", "elfType" : 3, "buildId" : "AAEB3A8C75F24B6EF9E965F4C7B41F1D10E4A1E3" }, \{ "b" : "7F9587F46000", "path" : "/lib64/libsmime3.so", "elfType" : 3, "buildId" : "CED43363B6A38A426D2A5EEECC7A267DD7BFBD60" }, \{ "b" : "7F9587C12000", "path" : "/lib64/libnss3.so", "elfType" : 3, "buildId" : "8EBF98BF33F01E42E4388F6E256B56D1325A54EA" }, \{ "b" : "7F95879E2000", "path" : "/lib64/libnssutil3.so", "elfType" : 3, "buildId" : "A61604F9C4E3F975A0A1742174F08D6ECF987A63" }, \{ "b" : "7F95877DE000", "path" : "/lib64/libplds4.so", "elfType" : 3, "buildId" : "E761D8698407F6521F26F579D61D5EC8F7EF04A9" }, \{ "b" : "7F95875D9000", "path" : "/lib64/libplc4.so", "elfType" : 3, "buildId" : "41E234507D6BB1E4FE56A078127D36C1963460CC" }, \{ "b" : "7F958739B000", "path" : "/lib64/libnspr4.so", "elfType" : 3, "buildId" : "051956498509E67F79215B76781C1AA0504EB5D5" }, \{ "b" : "7F958700D000", "path" : "/usr/lib64/perl5/CORE/libperl.so", "elfType" : 3, "buildId" : "8333C99F8CCDF10698873D515DD7079FE250B89B" }, \{ "b" : "7F9586DF3000", "path" : "/lib64/libnsl.so.1", "elfType" : 3, "buildId" : "96FFFD935288296AF9A07B535C736E0E376BA55A" }, \{ "b" : "7F9586BBC000", "path" : "/lib64/libcrypt.so.1", "elfType" : 3, "buildId" : "164A07A654E3B6AA09A43BF9ACE3728AB02BD0D7" }, \{ "b" : "7F95869B9000", "path" : "/lib64/libutil.so.1", "elfType" : 3, "buildId" : "1586CEFA927D26F144DE15389F28C1CBF04C81EF" }, \{ "b" : "7F95867A9000", "path" : "/lib64/libbz2.so.1", "elfType" : 3, "buildId" : "0C85C0386F0CF41EA39969CF7F58A558D1AD3235" }, \{ "b" : "7F9586593000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "B9D5F73428BD6AD68C96986B57BEA3B7CEDB9745" }, \{ "b" : "7F958637B000", "path" : "/lib64/libelf.so.1", "elfType" : 3, "buildId" : "EC1BEB167783D08FD5F52C6CC707C8F38E637AF2" }, \{ "b" : "7F9586155000", "path" : "/lib64/liblzma.so.5", "elfType" : 3, "buildId" : "3B2C97C1937B73A69C412A96D0810C43DF0C6F54" }, \{ "b" : "7F9585F4B000", "path" : "/lib64/libpopt.so.0", "elfType" : 3, "buildId" : "7AE00165FBAF6920DD5AED6905820DDBAB589E84" }, \{ "b" : "7F9585D24000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "903A0BD0BFB4FEE8C284F41BEB9773DED94CBC52" }, \{ "b" : "7F9585B1F000", "path" : "/lib64/libcap.so.2", "elfType" : 3, "buildId" : "3BC565E0565C33B1BD37AE0070F7D8E2CE4313E4" }, \{ "b" : "7F9585916000", "path" : "/lib64/libacl.so.1", "elfType" : 3, "buildId" : "5F8AD935C32D87C9A73FE96A36A6D9C68C90D452" }, \{ "b" : "7F95856E8000", "path" : "/lib64/liblua-5.1.so", "elfType" : 3, "buildId" : "BDD4B9CFC1D3F31D3A5A430D2F9080E020C5B0BA" }, \{ "b" : "7F9585329000", "path" : "/lib64/libdb-5.3.so", "elfType" : 3, "buildId" : "CA8916E2C5EB6FF8582E059700E3347178823728" }, \{ "b" : "7F9585100000", "path" : "/lib64/libaudit.so.1", "elfType" : 3, "buildId" : "2E36E1B9A2D92C969E38CDDCC729F55D8BACBB2B" }, \{ "b" : "7F9584E17000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "4EBF28968DA5784ED6606BFF7C1915C50AC24502" }, \{ "b" : "7F9584C13000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "2ADDB65846A50CE45F0C9B62EA35DDA62C6AD7A2" }, \{ "b" : "7F95849E0000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "5FF9D1075A8D5D62F77F5CE56C935FCD92C62EFA" }, \{ "b" : "7F95847D0000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "779381063DAECC27E8480C8F79F0651162586478" }, \{ "b" : "7F95845CC000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "2E01D5AC08C1280D013AAB96B292AC58BC30A263" }, \{ "b" : "7F9584399000", "path" : "/lib64/libidn.so.11", "elfType" : 3, "buildId" : "2B77BBEFFF65E94F3E0B71A4E89BEB68C4B476C5" }, \{ "b" : "7F958416C000", "path" : "/lib64/libssh2.so.1", "elfType" : 3, "buildId" : "CB0BD6C014F41EC926FAC41322C82FF4A5EB88B9" }, \{ "b" : "7F9583F69000", "path" : "/lib64/libfreebl3.so", "elfType" : 3, "buildId" : "020C788B41DCC71AEE66B822D7670BC4347DA006" }, \{ "b" : "7F9583D07000", "path" : "/lib64/libpcre.so.1", "elfType" : 3, "buildId" : "9CA3D11F018BEEB719CDB34BE800BF1641350D0A" }, \{ "b" : "7F9583B02000", "path" : "/lib64/libattr.so.1", "elfType" : 3, "buildId" : "2617ECC6738047E207AE3ADD990BD6A34D11B265" }, \{ "b" : "7F95838FC000", "path" : "/lib64/libcap-ng.so.0", "elfType" : 3, "buildId" : "43578677DF613E9D58128ED4AE0C344FBC1E44C0" }, \{ "b" : "7F95836F7000", "path" : "/usr/lib64/sasl2/libanonymous.so", "elfType" : 3, "buildId" : "471C3A1DE301A04992FA4046A17AA00EB9F6BC3D" }, \{ "b" : "7F95834F0000", "path" : "/usr/lib64/sasl2/libsasldb.so", "elfType" : 3, "buildId" : "A4CE135A2E85455B83669F18F33F81E2B6935288" }, \{ "b" : "7F95832E7000", "path" : "/usr/lib64/sasl2/libgssapiv2.so", "elfType" : 3, "buildId" : "50AA22CE9DCB3B3BF27ACFD9AB8779533401CE99" }, \{ "b" : "7F95830E2000", "path" : "/usr/lib64/sasl2/liblogin.so", "elfType" : 3, "buildId" : "195C9ABB91F2E86709E4A73FB11C57838AE5CFE2" }, \{ "b" : "7F9582EDD000", "path" : "/usr/lib64/sasl2/libplain.so", "elfType" : 3, "buildId" : "728E54F69EF3F1EF806223E14AB8E3789FADC6C0" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x559a85513121]
 mongod(+0x2A3474C) [0x559a8551274c]
 mongod(+0x2A347D6) [0x559a855127d6]
 libpthread.so.0(+0xF630) [0x7f95887a8630]
 libc.so.6(gsignal+0x37) [0x7f9588401387]
 libc.so.6(abort+0x148) [0x7f9588402a78]
 mongod(_ZN5mongo12verifyFailedEPKcS1_j+0x0) [0x559a8384a9bb]
 mongod(+0xD43290) [0x559a83821290]
 mongod(_ZN5mongo10LockerImplD0Ev+0x11) [0x559a853d8ec1]
 mongod(_ZN5mongo16OperationContextD1Ev+0xA7) [0x559a8540cbe7]
 mongod(_ZN5mongo16OperationContextD0Ev+0x11) [0x559a8540ccc1]
 mongod(_ZNK5mongo14ServiceContext23OperationContextDeleterclEPNS_16OperationContextE+0xDC) [0x559a8540f88c]
 mongod(+0x14DF151) [0x559a83fbd151]
 mongod(+0xB60156) [0x559a8363e156]
 mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE+0x113) [0x559a84bd2bd3]
 mongod(+0x20F50F2) [0x559a84bd30f2]
 mongod(_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockINS_12latch_detail5LatchEE+0xFA) [0x559a842f7bda]
 mongod(_ZN5mongo10ThreadPool13_consumeTasksEv+0x91) [0x559a842f90a1]
 mongod(_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x99) [0x559a842fa1a9]
 mongod(+0x181CA72) [0x559a842faa72]
 mongod(+0x2B5C8BF) [0x559a8563a8bf]
 libpthread.so.0(+0x7EA5) [0x7f95887a0ea5]
 libc.so.6(clone+0x6D) [0x7f95884c996d]
----- END BACKTRACE -----



 Comments   
Comment by Kelsey Schubert [ 31/Mar/22 ]

After investigation, we have determined that this issue was resolved by SERVER-60685.

Comment by Kelsey Schubert [ 19/Aug/21 ]

connie.chen, I'm confused. If the bug is still present, shouldn't this ticket remain open? It's a lot easier to find an open ticket (currently it looks the bug has been resolved if you look at the jira metadata). I'm planning to reopen this ticket and park it on the backlog and move it back into needs scheduling when we have a recurrence with the new logging. Let me know if I'm missing something with my approach.

Comment by Connie Chen [ 09/Jul/21 ]

Logging has already been committed on master and will be backported to 4.2 and 4.4. Reopen this ticket if invariant happens again. 

Comment by Dianna Hohensee (Inactive) [ 02/Jun/21 ]

So the TransactionCoordinator only appears to take a mutex(es) in the implementation. The TransactionCoordinator operation fails, as Dan says above, in a DBDirectClient command to do a write. The logs have expired, but we've still got Dan's analysis.

It's unclear to me what could have gone wrong. The TransactionParticipant is the only component in the codebase using Locker instances outside of RAII types, but the TransactionCoordinator isn't involved with the TransactionParticipant at all as far as I can tell. We could add some logging information to the invariant failure. It isn't clear what information would be helpful, so basically whatever we can. ~LockerImpl checks that the lock requests are empty, which is a map of ResourceId to LockRequest -- we could log some of this information.

Comment by Daniel Gottlieb (Inactive) [ 29/Mar/21 ]

I'm not the expert in this, but I don't think the transaction code is breaking any API contracts that led to this invariant. I'm passing this off to execution to take a look. My observations:

  • The invariant of there being leftover "requests" in an OpCtx's/LockerImpl when the OpCtx is being destructed first suggests some code grabbed a lock without an RAII helper and forgot to unlock it (presumably in the case of some error).
    • I could not find any non-RAII lock usage in the TransactionCoordinator code paths.
    • Having gone through the lock manager code, we add a lock request to the map at the start of trying to get the lock. From what I could tell, we take care to remove the request in any critical section where we can bail (i.e: hitting a timeout/being interrupted). Execution will have a keener eye to tell if anything is suspicious.
  • Right before the TransactionCoordinator thread hit the invariant, it logged a slow op line:
    • Based on the query/update in the slow op log, this looks like the transaction coordinator was persisting the participant list:

      u: { _id: { lsid: { id: <uuid>), uid: BinData(...) }, txnNumber: 25 }, participants: [ <shard0>, <shard1> ] }, multi: false, upsert: true
      

    • The TransactionCoordinator looks like it hit its self-imposed (default 60s) timeout:

      errName:TransactionCoordinatorReachedAbortDecision errCode:282
      

    • The DBDirectClient command (cooperatively) observed the timeout:

      errMsg:"operation was interrupted"
      

    • The command got the PBWM and RSTL successfully, but spent the entire 60s timeout waiting on the global lock:

      locks:{ ParallelBatchWriterMode: { acquireCount: { r: 1 } }, ReplicationStateTransition: { acquireCount: { w: 1 } }, Global: { acquireCount: { w: 1 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 59040493 } } } 59403ms
      

Comment by Kaloian Manassiev [ 20/Mar/21 ]

This is likely in the 2PC transactions path, because of the presence of the AsyncWorkScheduler in the stacks. Moving to Sharding-NYC.

Generated at Thu Feb 08 05:36:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.