[SERVER-27796] Invariant failure in cluster_cursor_manager.cpp, numDeleted == 1 Created: 24/Jan/17  Updated: 06/Apr/21  Resolved: 05/Apr/21

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 3.2.0, 3.4.0, 3.6.7, 4.0.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Samantha Ritter (Inactive) Assignee: David Storch
Resolution: Duplicate Votes: 1
Labels: query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-36926 Undefined behavior from signed overfl... Closed
Related
is related to SERVER-49888 splitChunk command fails periodically Closed
is related to SERVER-47869 Add diagnostic logging to ClusterCurs... Closed
is related to SERVER-49005 Improve concurrency testing for mongo... Backlog
Operating System: ALL
Sprint: Query 2017-03-06, Query 2017-07-10, Query 2020-04-06, Query 2020-04-20, Query 2020-05-04, Query Execution 2021-02-22, Query Execution 2021-03-08, Query Execution 2021-04-05, Query Execution 2021-04-19
Participants:
Case:

 Description   

A user experienced the following invariant crash:

2017-01-19T00:51:36.321+0800 I SHARDING [Balancer] distributed lock with ts: 587f9d18e078be11593c7930' unlocked.
2017-01-19T00:51:43.319+0800 I -        [conn23249] Invariant failure numDeleted == 1 src/mongo/s/query/cluster_cursor_manager.cpp 472
 
2017-01-19T00:51:43.319+0800 I -        [conn23249]
 
***aborting after invariant() failure
 
 
2017-01-19T00:51:43.323+0800 F -        [conn23249] Got signal: 6 (Aborted).
 
 0xc42e82 0xc41da9 0xc425b2 0x366160f710 0x3661232625 0x3661233e05 0xbd7b4b 0xb794f8 0xb79779 0xb79a11 0xb7bf42 0xb32506 0xb84cb4 0xb8597d 0xb9418d 0xb842f6 0x672ea5 0xbfd2cd 0x36616079d1 0x36612e89dd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"842E82","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"841DA9"},{"b":"400000","o":"8425B2"},{"b":"3661600000","o":"F710"},{"b":"3661200000","o":"32625","s":"gsignal"},{"b":"3661200000","o":"33E05","s":"abort"},{"b":"400000","o":"7D7B4B","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"400000","o":"7794F8","s":"_ZN5mongo20ClusterCursorManager19detachCursor_inlockERKNS_15NamespaceStringEx"},{"b":"400000","o":"779779","s":"_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE"},{"b":"400000","o":"779A11","s":"_ZN5mongo20ClusterCursorManager12PinnedCursor12returnCursorENS0_11CursorStateE"},{"b":"400000","o":"77BF42","s":"_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE"},{"b":"400000","o":"732506"},{"b":"400000","o":"784CB4","s":"_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderE"},{"b":"400000","o":"78597D","s":"_ZN5mongo7Command20runAgainstRegisteredEPNS_16OperationContextEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi"},{"b":"400000","o":"79418D","s":"_ZN5mongo8Strategy15clientCommandOpEPNS_16OperationContextERNS_7RequestE"},{"b":"400000","o":"7842F6","s":"_ZN5mongo7Request7processEPNS_16OperationContextEi"}



 Comments   
Comment by David Storch [ 06/Apr/21 ]

A quick update: I just finished backporting SERVER-36926 to the 4.0 and 3.6 branches. I expect the fix to be released as part of versions 4.0.24 and 3.6.24.

Comment by David Storch [ 05/Apr/21 ]

A core dump provided by a customer who experienced this issue has finally allowed us to identify the root cause of this bug! In short, the problem was already reported in SERVER-36926. I am closing this ticket as a duplicate of SERVER-36926; I encourage watchers of this ticket to also watch SERVER-36926 for updates. We fixed SERVER-36926 in 4.1.3, and first released it as part of 4.2, which explains why this issue has not yet been observed in 4.2.x or 4.4.x versions. Given this root cause analysis, we will backport SERVER-36926 to the 4.0 and 3.6 branches. Read on for the technical details of the root cause analysis.

In the core dump, the cursor id associated with the getMore that caused mongos to crash is -9223372036236470444, or 0x8000000024da9754 when interpreting the bits as hex. This is interesting, because mongos treats the high-order 32 bits of the cursor id as special by assigning all cursors associated with a particular collection the same 32 bit prefix. The high order bits for our cursor id of interest are 0x80000000. This bit pattern is equivalent to INT_MIN when interpreted as a 2's complement 32 bit signed integer, which immediately points to SERVER-36926.

To elaborate, the 32 bit prefixes are assigned by generating a random int32 and taking its absolute value. SERVER-36926 observed that the call to std::abs() results in undefined behavior due to signed overflow if the int32 is INT_MIN. The fix for the bug introduced a special check for INT_MIN. In versions where the fix is not in place, it appears that what actually ends up happening is that INT_MIN is reinterpreted as the unsigned int 0x80000000, or 2147483648. In the core dump, 0x80000000 is the value for containerPrefix stored inside the relevant entry of the _namespaceToContainerMap.

For reasons I don't fully comprehend, the core dump proves that this scenario results in corruption of the _cursorIdPrefixToNamespaceMap. I can see from the core dump that the (0x80000000, "customer_db.customer_namespace") pair should be stored in bucket index 2 according to the hash function. However, this data is actually stored in a different bucket. It appears that the undefined behavior caused the hash to be computed incorrectly at insert time, resulting in a state where the data for key 0x80000000 is stored in the wrong bucket! The corrupted hash table causes the invariant to trip downstream, since the invariant is checking that the hash table is in the expected state during deletion of a cursor.

Thanks to ian.boros for aiding me in the diagnosis of this issue! Ian also helped create a repro to test our theory. In particular, we reproduced the issue by applying the following patch against the 4.0 branch and then running the cluster_cursor_manager_test unit test:

diff --git a/src/mongo/s/query/cluster_cursor_manager.cpp b/src/mongo/s/query/cluster_cursor_manager.cpp
index 5c267bde0a..529498b91a 100644
--- a/src/mongo/s/query/cluster_cursor_manager.cpp
+++ b/src/mongo/s/query/cluster_cursor_manager.cpp
@@ -250,7 +250,11 @@ StatusWith<CursorId> ClusterCursorManager::registerCursor(
             // The server has always generated positive values for CursorId (which is a signed
             // type), so we use std::abs() here on the prefix for consistency with this historical
             // behavior.
-            containerPrefix = static_cast<uint32_t>(std::abs(_pseudoRandom.nextInt32()));
+            auto rand = _pseudoRandom.nextInt32();
+            if (rand < 1000 * 1000 * 1000) {
+                rand = std::numeric_limits<int32_t>::min();
+            }
+            containerPrefix = static_cast<uint32_t>(std::abs(rand));
         } while (_cursorIdPrefixToNamespaceMap.count(containerPrefix) > 0);
         _cursorIdPrefixToNamespaceMap[containerPrefix] = nss;

Running this repro through gdb, I see a similar case of an inconsistent std::unordered_map where the data associated with key 0x80000000 appears in the wrong bucket.

Comment by Githook User [ 03/Feb/21 ]

Author:

{'name': 'Ian Boros', 'email': 'ian.boros@mongodb.com', 'username': 'puppyofkosh'}

Message: SERVER-47869 Add debug logging mechanism to mongos cursor manager

There is a bug in the mongos cursor manager tracked by SERVER-27796 which has never been reproduced
in our testing environment but has been encountered by multiple customers. This change adds a small
diagnostic log to the mongos cursor manager which is dumped when a symptom of the bug is found.
Branch: v4.0
https://github.com/mongodb/mongo/commit/30ebef0be7a29c9f7f7630615419eaead7c071c5

Comment by Githook User [ 02/Feb/21 ]

Author:

{'name': 'Ian Boros', 'email': 'ian.boros@mongodb.com', 'username': 'puppyofkosh'}

Message: SERVER-47869 Add debug logging mechanism to mongos cursor manager

There is a bug in the mongos cursor manager tracked by SERVER-27796 which has never been reproduced
in our testing environment but has been encountered by multiple customers. This change adds a small
diagnostic log to the mongos cursor manager which is dumped when a symptom of the bug is found.
Branch: v4.2
https://github.com/mongodb/mongo/commit/1c66cbdf54e8846022c47aa16e53e7c1750756e2

Comment by Githook User [ 01/Feb/21 ]

Author:

{'name': 'Ian Boros', 'email': 'ian.boros@mongodb.com', 'username': 'puppyofkosh'}

Message: SERVER-47869 Add debug logging mechanism to mongos cursor manager

There is a bug in the mongos cursor manager tracked by SERVER-27796 which has never been reproduced
in our testing environment but has been encountered by multiple customers. This change adds a small
diagnostic log to the mongos cursor manager which is dumped when a symptom of the bug is found.
Branch: v4.4
https://github.com/mongodb/mongo/commit/e1233abcbe7e5950c0a200f10277629352922750

Comment by David Storch [ 14/May/20 ]

To elaborate on Ian's comment, we think we need to gather more diagnostic information at the time of this crash in order to understand the problem. Several engineers have inspected the code since this issue was first reported without spotting the error. We've seen it in the wild a handful of times, but never in our numerous concurrency stress tests that run in our CI loop. We hope to catch another instance of this after SERVER-47869 is merged and released in order to gather more debug info.

Comment by Ian Boros [ 14/May/20 ]

Returning this to the backlog now that SERVER-47869 is in progress.

Comment by David Storch [ 07/Dec/18 ]

Thanks filip.havlicek@livesport.eu. Do let us know if you have any more details about what can provoke this crash. I've marked this ticket as affecting the 3.6 branch. So far we've seen it on 3.2, 3.4, and 3.6 but never on 4.0. My suspicion is that 4.0 is affected as well, however.

Comment by Filip Havlí?ek [ 06/Dec/18 ]

Unfortunately I don't have more details. It was the first time when it happened in our cluster. If it occurs again in the futere I will try to find more details.

I just want to let you know that this issue was seen on version 3.6 too.

Comment by David Storch [ 06/Dec/18 ]

filip.havlicek@livesport.eu, do you have any details about how to reproduce this crash, or at least knowledge of the workload that mongos was processing at the time of the crash? We've seen this happen a handful of times, but it seems to be rare and hard to reproduce, which has made it difficult to reach a root cause diagnosis.

CC ian.boros

Comment by Filip Havlí?ek [ 06/Dec/18 ]

probably the same/similar issue on 3.6

Debian Stretch (kernel 4.4.112)

package mongodb-org-mongos (3.6.7) from official repositor

mongos version v3.6.7
 OpenSSL version: OpenSSL 1.1.0j 20 Nov 2018
 allocator: tcmalloc
 modules: none
 git version: 2628472127e9f1826e02c665c1d93880a204075e
 distmod: debian92
 build environment:
   target_arch: x86_64
   distarch: x86_64

  

December 6th 2018, 01:20:29.886 [conn10467713] Invariant failure numDeleted == 1 src/mongo/s/query/cluster_cursor_manager.cpp 624
December 6th 2018, 01:20:29.889 [conn10467713]
***aborting after invariant() failure
December 6th 2018, 01:20:29.913 [conn10467713] Got signal: 6 (Aborted).
0x55971cbc79e1 0x55971cbc6bf9 0x55971cbc70dd 0x7f6500ed70c0 0x7f6500b59fff 0x7f6500b5b42a 0x55971bff3cd8 0x55971c3a46e8 0x55971c3a4946 0x55971c3a4b8d 0x55971c1477a6 0x55971c0d12d6 0x55971c4fc836 0x55971c4f7adf 0x55971c123f38 0x55971c124f93 0x55971c125679 0x55971c046145 0x55971c0629ea 0x55971c05e3a7 0x55971c0617e1 0x55971c4b8bc2 0x55971c05d20f 0x55971c05f755 0x55971c06004b 0x55971c05e42d 0x55971c0617e1 0x55971c4b9125 0x55971ca75da4 0x7f6500ecd494 0x7f6500c0facf
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"55971BAF0000","o":"10D79E1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"55971BAF0000","o":"10D6BF9"},{"b":"55971BAF0000","o":"10D70DD"},{"b":"7F6500EC6000","o":"110C0"},{"b":"7F6500B27000","o":"32FFF","s":"gsignal"},{"b":"7F6500B27000","o":"3442A","s":"abort"},{"b":"55971BAF0000","o":"503CD8","s":"_ZN5mongo22invariantFailedWithMsgEPKcS1_S1_j"},{"b":"55971BAF0000","o":"8B46E8","s":"_ZN5mongo20ClusterCursorManager13_detachCursorENS_8WithLockERKNS_15NamespaceStringEx"},{"b":"55971BAF0000","o":"8B4946","s":"_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE"},{"b":"55971BAF0000","o":"8B4B8D","s":"_ZN5mongo20ClusterCursorManager12PinnedCursor12returnCursorENS0_11CursorStateE"},{"b":"55971BAF0000","o":"6577A6","s":"_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE"},{"b":"55971BAF0000","o":"5E12D6"},{"b":"55971BAF0000","o":"A0C836","s":"_ZN5mongo12BasicCommand11enhancedRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE"},{"b":"55971BAF0000","o":"A07ADF","s":"_ZN5mongo7Command9publicRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE"},{"b":"55971BAF0000","o":"633F38"},{"b":"55971BAF0000","o":"634F93"},{"b":"55971BAF0000","o":"635679","s":"_ZN5mongo8Strategy13clientCommandEPNS_16OperationContextERKNS_7MessageE"},{"b":"55971BAF0000","o":"556145","s":"_ZN5mongo23ServiceEntryPointMongos13handleRequestEPNS_16OperationContextERKNS_7MessageE"},{"b":"55971BAF0000","o":"5729EA","s":"_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE"},{"b":"55971BAF0000","o":"56E3A7","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"55971BAF0000","o":"5717E1"},{"b":"55971BAF0000","o":"9C8BC2","s":"_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE"},{"b":"55971BAF0000","o":"56D20F","s":"_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE"},{"b":"55971BAF0000","o":"56F755","s":"_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE"},{"b":"55971BAF0000","o":"57004B","s":"_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE"},{"b":"55971BAF0000","o":"56E42D","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"55971BAF0000","o":"5717E1"},{"b":"55971BAF0000","o":"9C9125"},{"b":"55971BAF0000","o":"F85DA4"},{"b":"7F6500EC6000","o":"7494"},{"b":"7F6500B27000","o":"E8ACF","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.7", "gitVersion" : "2628472127e9f1826e02c665c1d93880a204075e", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.112-ls-3", "version" : "#3 SMP Mon Feb 12 14:56:37 CET 2018", "machine" : "x86_64" }, "somap" : [ { "b" : "55971BAF0000", "elfType" : 3, "buildId" : "1895B75D6C02803A30D7407F21FC01668D053F7D" }, { "b" : "7FFD87184000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "DEBF60FC8D64715B7432B531F79998366AB94138" }, { "b" : "7F6502109000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "713D47D5F599289C0A91ADE8F0122B2B4AA78B2E" }, { "b" : "7F6501C76000", "path" : "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1", "elfType" : 3, "buildId" : "2CFE882A331D7857E9CE1B5DE3255E6DA76EF899" }, { "b" : "7F6501A0A000", "path" : "/usr/lib/x86_64-linux-gnu/libssl.so.1.1", "elfType" : 3, "buildId" : "E2AA3B39763D943F56B3BD05C8E36E639BA95E12" }, { "b" : "7F6501806000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "B895F0831F623C5F23603401D4069F9F94C24761" }, { "b" : "7F65015FE000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "5D83E0642E645026DBB11F89F7DF7106BD821495" }, { "b" : "7F65012FA000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1B95E3A8B8788B07E4F59EE69B1877F9DEB42033" }, { "b" : "7F65010E3000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "51AD5FD294CD6C813BED40717347A53434B80B7A" }, { "b" : "7F6500EC6000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "4285CD3158DDE596765C747AE210AB6CBD258B22" }, { "b" : "7F6500B27000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "AA889E26A70F98FA8D230D088F7CC5BF43573163" }, { "b" : "7F6502320000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "263F909DBE11A66F7C6233E3FF0521148D9F8370" }, { "b" : "7F6500915000", "path" : "/lib/x86_64-linux-gnu/libnss_files.so.2", "elfType" : 3, "buildId" : "9D8B83EEDDF7CB86FF2022D501A98C310BD4C75B" }, { "b" : "7F650070F000", "path" : "/lib/x86_64-linux-gnu/libnss_dns.so.2", "elfType" : 3, "buildId" : "48C37C0EB2099FF3D698B9F6E006B7EB76994AD5" } ] }}
 mongos(_ZN5mongo15printStackTraceERSo+0x41) [0x55971cbc79e1]
 mongos(+0x10D6BF9) [0x55971cbc6bf9]
 mongos(+0x10D70DD) [0x55971cbc70dd]
 libpthread.so.0(+0x110C0) [0x7f6500ed70c0]
 libc.so.6(gsignal+0xCF) [0x7f6500b59fff]
 libc.so.6(abort+0x16A) [0x7f6500b5b42a]
 mongos(_ZN5mongo22invariantFailedWithMsgEPKcS1_S1_j+0x0) [0x55971bff3cd8]
 mongos(_ZN5mongo20ClusterCursorManager13_detachCursorENS_8WithLockERKNS_15NamespaceStringEx+0x218) [0x55971c3a46e8]
 mongos(_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE+0x136) [0x55971c3a4946]
 mongos(_ZN5mongo20ClusterCursorManager12PinnedCursor12returnCursorENS0_11CursorStateE+0x4D) [0x55971c3a4b8d]
 mongos(_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE+0x1C6) [0x55971c1477a6]
 mongos(+0x5E12D6) [0x55971c0d12d6]
 mongos(_ZN5mongo12BasicCommand11enhancedRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE+0x76) [0x55971c4fc836]
 mongos(_ZN5mongo7Command9publicRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE+0x1F) [0x55971c4f7adf]
 mongos(+0x633F38) [0x55971c123f38]
 mongos(+0x634F93) [0x55971c124f93]
 mongos(_ZN5mongo8Strategy13clientCommandEPNS_16OperationContextERKNS_7MessageE+0x59) [0x55971c125679]
 mongos(_ZN5mongo23ServiceEntryPointMongos13handleRequestEPNS_16OperationContextERKNS_7MessageE+0x5B5) [0x55971c046145]
 mongos(_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE+0xBA) [0x55971c0629ea]
 mongos(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0x97) [0x55971c05e3a7]
 mongos(+0x5717E1) [0x55971c0617e1]
 mongos(_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE+0x1A2) [0x55971c4b8bc2]
 mongos(_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE+0x15F) [0x55971c05d20f]
 mongos(_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE+0xAF5) [0x55971c05f755]
 mongos(_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE+0x23B) [0x55971c06004b]
 mongos(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0x11D) [0x55971c05e42d]
 mongos(+0x5717E1) [0x55971c0617e1]
 mongos(+0x9C9125) [0x55971c4b9125]
 mongos(+0xF85DA4) [0x55971ca75da4]
 libpthread.so.0(+0x7494) [0x7f6500ecd494]
 libc.so.6(clone+0x3F) [0x7f6500c0facf]
----- END BACKTRACE -----

 

Comment by Ian Boros [ 29/Aug/18 ]

I read through through the code some more with David Storch, and he noticed that the call to std::abs(_pseudoRandom.nextInt32()) here could result in undefined behavior if _pseudoRandom.nextInt32() returns MIN_INT. We suspect that most compilers will do something "reasonable" in this situation, however, and are doubtful that it's the root cause of the bug (though I filed SERVER-36926). We also still have this code on the master branch, but have never seen the bug on versions other than 3.2.

Comment by Ian Boros [ 29/Aug/18 ]

For our posterity, here's a symbolized version of the stack trace posted by Nuno Costa.

 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/util/stacktrace_posix.cpp:171:0: mongo::printStackTrace(std::ostream&)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/util/signal_handlers_synchronous.cpp:180:0: mongo::(anonymous namespace)::printSignalAndBacktrace(int)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/util/signal_handlers_synchronous.cpp:237:0: mongo::(anonymous namespace)::abruptQuit(int)
 ??:0:0: ??
 ??:0:0: ??
 ??:0:0: ??
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/util/assert_util.cpp:154:0: mongo::invariantFailed(char const*, char const*, unsigned int)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/query/cluster_cursor_manager.cpp:472:0: mongo::ClusterCursorManager::detachCursor_inlock(mongo::NamespaceString const&, long long)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/query/cluster_cursor_manager.cpp:310:0: mongo::ClusterCursorManager::checkInCursor(std::unique_ptr<mongo::ClusterClientCursor, std::default_delete<mongo::ClusterClientCursor> >, mongo::NamespaceString const&, long long, mongo::ClusterCursorManager::CursorState)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/query/cluster_cursor_manager.cpp:131:0: mongo::ClusterCursorManager::PinnedCursor::returnCursor(mongo::ClusterCursorManager::CursorState)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/query/cluster_find.cpp:474:0: mongo::ClusterFind::runGetMore(mongo::OperationContext*, mongo::GetMoreRequest const&)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/commands/cluster_getmore_cmd.cpp:108:0: mongo::(anonymous namespace)::ClusterGetMoreCmd::run(mongo::OperationContext*, std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/s_only.cpp:128:0: mongo::Command::execCommandClientBasic(mongo::OperationContext*, mongo::Command*, mongo::ClientBasic&, int, char const*, mongo::BSONObj&, mongo::BSONObjBuilder&)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/s_only.cpp:171:0: mongo::Command::runAgainstRegistered(mongo::OperationContext*, char const*, mongo::BSONObj&, mongo::BSONObjBuilder&, int)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/strategy.cpp:237:0: mongo::Strategy::clientCommandOp(mongo::OperationContext*, mongo::Request&)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/request.cpp:109:0: mongo::Request::process(mongo::OperationContext*, int)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/server.cpp:200:0: mongo::ShardedMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*)
 /data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/util/net/message_server_port.cpp:227:0: mongo::PortMessageServer::handleIncomingMsg(void*)
 ??:0:0: ??
 ??:0:0: ??

A few more observations:
1) This invariant failure has only happened while we were returning a cursor to the cursor manager via ClusterFind::runGetMore. The other code path which may detach/destroy cursors is from the cursor reaper. It's strange that we've never seen this invariant trigger from that path.

2) At the time we tried to use invalidated_cursors.js, there was a bug (see SERVER-33603) which was preventing it from running getMores on the correct collection. I re-ran the test with the bug fixed but still wasn't able to reproduce the problem.

3) I also made some attempts at running the test with
a) Fewer possible cursor IDs (in case there is some kind of cursor collision happening)
b) The ClusterCursorCleanupJob running more frequently

We can also try to reason about what must have happened based on the invariant failure, though this approach hasn't gotten me very far.
-The invariant is that numDeleted is 1.
-numDeleted is the return value from a call to std::map::erase, which is the number of elements removed. For a std::map, either 0 or 1. So numDeleted was 0.
-The way I see it, there are two options:
(a) _cursorIdPrefixToNamespaceMap is missing elements(it is out of sync with _namespaceToContainerMap)
(b) the cursorPrefix which we attempted to erase here is bogus.

(a) Does not seem plausbile. _namespaceToContainerMap and _cursorIdPrefixToNamespaceMap each only have one site where an element is added (registerCursor) and one place where an element is deleted (detachCursor_inlock). In both places each map is updated.

I have to spend some more time looking into (b).

I think my remaining steps will be to
-Look into option (b)
-Look into whether auth failures could cause this problem (Many of the logs posted by Nuno Costa had a lot of auth failures soon before the invariant triggered).

Comment by Nakai Kanako [ 17/Nov/17 ]

We got same issue on a mongos.
If you want additional info, please let me know.
Version: mongodb-linux-x86_64-3.2.4
OS: CentOS Linux release 7.2.1511 (Core)

2017-09-06T00:14:43.579+0900 I -        [conn225] Invariant failure numDeleted == 1 src/mongo/s/query/cluster_cursor_manager.cpp 472
2017-09-06T00:14:43.579+0900 I -        [conn225] 
 
***aborting after invariant() failure
 
 
2017-09-06T00:14:43.593+0900 F -        [conn225] Got signal: 6 (Aborted).
 
 0xc12fc2 0xc11ee9 0xc126f2 0x7f1488ebe100 0x7f1488b235f7 0x7f1488b24ce8 0xba7c7b 0xb48ee8 0xb49169 0xb49401 0xb4b932 0xb01bd6 0xb545b4 0xb5527d 0xb63c3d 0xb53bd6 0x646115 0xbcd3fd 0x7f1488eb6dc5 0x7f1488be428d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"812FC2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"811EE9"},{"b":"400000","o":"8126F2"},{"b":"7F1488EAF000","o":"F100"},{"b":"7F1488AEE000","o":"355F7","s":"gsignal"},{"b":"7F1488AEE000","o":"36CE8","s":"abort"},{"b":"400000","o":"7A7C7B","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"400000","o":"748EE8","s":"_ZN5mongo20ClusterCursorManager19detachCursor_inlockERKNS_15NamespaceStringEx"},{"b":"400000","o":"749169","s":"_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE"},{"b":"400000","o":"749401","s":"_ZN5mongo20ClusterCursorManager12PinnedCursor12returnCursorENS0_11CursorStateE"},{"b":"400000","o":"74B932","s":"_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE"},{"b":"400000","o":"701BD6"},{"b":"400000","o":"7545B4","s":"_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderE"},{"b":"400000","o":"75527D","s":"_ZN5mongo7Command20runAgainstRegisteredEPNS_16OperationContextEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi"},{"b":"400000","o":"763C3D","s":"_ZN5mongo8Strategy15clientCommandOpEPNS_16OperationContextERNS_7RequestE"},{"b":"400000","o":"753BD6","s":"_ZN5mongo7Request7processEPNS_16OperationContextEi"},{"b":"400000","o":"246115","s":"_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE"},{"b":"400000","o":"7CD3FD","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"},{"b":"7F1488EAF000","o":"7DC5"},{"b":"7F1488AEE000","o":"F628D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.4", "gitVersion" : "e2ee9ffcf9f5a94fad76802e28cc978718bb7a30", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.10.0-327.36.3.el7.x86_64", "version" : "#1 SMP Mon Oct 24 16:09:20 UTC 2016", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFC2F889000", "elfType" : 3 }, { "b" : "7F1489AEF000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7F14898EB000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7F14895E3000", "path" : "/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F14892E1000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7F14890CB000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F1488EAF000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7F1488AEE000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7F1489CF7000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongos(_ZN5mongo15printStackTraceERSo+0x32) [0xc12fc2]
 mongos(+0x811EE9) [0xc11ee9]
 mongos(+0x8126F2) [0xc126f2]
 libpthread.so.0(+0xF100) [0x7f1488ebe100]
 libc.so.6(gsignal+0x37) [0x7f1488b235f7]
 libc.so.6(abort+0x148) [0x7f1488b24ce8]
 mongos(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0xba7c7b]
 mongos(_ZN5mongo20ClusterCursorManager19detachCursor_inlockERKNS_15NamespaceStringEx+0x1E8) [0xb48ee8]
 mongos(_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE+0x129) [0xb49169]
 mongos(_ZN5mongo20ClusterCursorManager12PinnedCursor12returnCursorENS0_11CursorStateE+0x41) [0xb49401]
 mongos(_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE+0x262) [0xb4b932]
 mongos(+0x701BD6) [0xb01bd6]
 mongos(_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderE+0x6E4) [0xb545b4]
 mongos(_ZN5mongo7Command20runAgainstRegisteredEPNS_16OperationContextEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x2ED) [0xb5527d]
 mongos(_ZN5mongo8Strategy15clientCommandOpEPNS_16OperationContextERNS_7RequestE+0x19D) [0xb63c3d]
 mongos(_ZN5mongo7Request7processEPNS_16OperationContextEi+0x876) [0xb53bd6]
 mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0x65) [0x646115]
 mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x26D) [0xbcd3fd]
 libpthread.so.0(+0x7DC5) [0x7f1488eb6dc5]
 libc.so.6(clone+0x6D) [0x7f1488be428d]
-----  END BACKTRACE  -----

Comment by Ian Whalen (Inactive) [ 22/Aug/17 ]

Resolving this as Cannot Reproduce since we currently don't have enough info to proceed on. Please do reopen if you have any other information you can provide.

Comment by Justin Seyster [ 12/Jul/17 ]

I adapted the invalidated_cursors.js test that david.storch mentioned so that it runs continuously in a sharded configuration, but after running it for ~48 hours, it did not produce a failure.

I'm going to return this to the backlog for now, so that we can come back to it when we have more information. That will probably mean waiting for it to reproduce in another context.

Steffen Thanks again for the detailed information and logs. When this reproduces, our team is going to be looking to see if the workload includes non-query commands that return cursors (like aggregate or listCollections) and also if there is the possibility that a collection got dropped around the time of the invariant failure. If you know whether those apply to your workload, we'd also be interested in knowing that.

Additionally, any information that you have about what commands or queries were running on mongos at the time of the failure and any general workload information that you have would be a huge help. Thanks!

Comment by David Storch [ 08/Jun/17 ]

Thanks for uploading the logs, steffen. The Query Team has analyzed them, but we were unable to find any useful information there. We have also done a code inspection of the ClusterCursorManager looking for a coding error that is consistent with this symptom. That has been a dead end so far.

We noticed that we have an existing test which attempts to stress concurrent cursor management: invalidated_cursors.js. However, this test has been blacklisted in a sharded configuration for an unrelated reason. The next action on this investigation should be to adapt the test to run sharded, and see if after many runs in our continuous integration loop we can reproduce this behavior.

Comment by Steffen [ 27/Apr/17 ]

Thanks, FYI: Sofar we don't know how to reproduce this. Also it's first time we noticed it.

Comment by Kelsey Schubert [ 27/Apr/17 ]

Hi Steffen,

I've created a secure upload portal for the logs. Files uploaded to this portal are visible only to MongoDB employees investigating this issue and are routinely deleted after some time.

Thank you,
Thomas

Comment by Steffen [ 27/Apr/17 ]
  1. 27 Shards and 138 mongos
  2. I can upload it to via private upload
  3. We don't know
  4. Both, but more reads
Comment by Tess Avitabile (Inactive) [ 27/Apr/17 ]

Hello Steffen,

Thank you for reporting this. To help us investigate this issue, can you provide us with the following information?

  • What is the topology of the cluster? How many mongoses and shards?
  • Can you provide us with the logs on the mongos leading up to the crash? (We can provide a secure upload site if you would not like to attach the logs to this ticket.)
  • This crash occurred while running a getMore. Do you happen to know what namespace (database and collection) this getMore was on? Was this namespace sharded?
  • What is the workload like on this cluster? Mostly reads or writes?

Thank you,
The Query Team

Comment by Steffen [ 27/Apr/17 ]

We seem to just hit same issue on a mongos:
MongoDB 3.2.10
Linux 4.4.0-71-generic #92~14.04.1-Ubuntu SMP Fri Mar 24 15:22:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

2017-04-27T06:20:02.388+0000 I CONTROL  [signalProcessingThread] db version v3.2.10
2017-04-27T06:20:02.388+0000 I CONTROL  [signalProcessingThread] git version: 79d9b3ab5ce20f51c272b4411202710a082d0317
2017-04-27T06:20:02.388+0000 I CONTROL  [signalProcessingThread] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
2017-04-27T06:20:02.388+0000 I CONTROL  [signalProcessingThread] allocator: tcmalloc
2017-04-27T06:20:02.388+0000 I CONTROL  [signalProcessingThread] modules: none
2017-04-27T06:20:02.388+0000 I CONTROL  [signalProcessingThread] build environment:
2017-04-27T06:20:02.388+0000 I CONTROL  [signalProcessingThread]     distmod: ubuntu1404
2017-04-27T06:20:02.388+0000 I CONTROL  [signalProcessingThread]     distarch: x86_64
2017-04-27T06:20:02.388+0000 I CONTROL  [signalProcessingThread]     target_arch: x86_64

2017-04-27T08:45:30.824+0000 I -        [conn3676] Invariant failure numDeleted == 1 src/mongo/s/query/cluster_cursor_manager.cpp 472
2017-04-27T08:45:30.824+0000 I -        [conn3676]
 
***aborting after invariant() failure
 
 
2017-04-27T08:45:30.858+0000 F -        [conn3676] Got signal: 6 (Aborted).
 
 0xc68172 0xc672c9 0xc67ad2 0x7f9bf18c1330 0x7f9bf151fc37 0x7f9bf1523028 0xbec7bb 0xb92118 0xb92399 0xb92631 0xb94b12 0xb4d4d6 0xb9d5f4 0xb9e2bd 0xbac89d 0xb9ca32 0x6b3ad5 0xc13db5 0x7f9bf18b9184 0x7f9bf15e6bed
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"868172","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"8672C9"},{"b":"400000","o":"867AD2"},{"b":"7F9BF18B1000","o":"10330"},{"b":"7F9BF14E9000","o":"36C37","s":"gsignal"},{"b":"7F9BF14E9000","o":"3A028","s":"abort"},{"b":"400000","o":"7EC7BB","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"400000",
"o":"792118","s":"_ZN5mongo20ClusterCursorManager19detachCursor_inlockERKNS_15NamespaceStringEx"},{"b":"400000","o":"792399","s":"_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE"},{"b":"400000","o":"792631","s":"_ZN5mongo20ClusterCursorM
anager12PinnedCursor12returnCursorENS0_11CursorStateE"},{"b":"400000","o":"794B12","s":"_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE"},{"b":"400000","o":"74D4D6"},{"b":"400000","o":"79D5F4","s":"_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BS
ONObjBuilderE"},{"b":"400000","o":"79E2BD","s":"_ZN5mongo7Command20runAgainstRegisteredEPNS_16OperationContextEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi"},{"b":"400000","o":"7AC89D","s":"_ZN5mongo8Strategy15clientCommandOpEPNS_16OperationContextERNS_7RequestE"},{"b":"400000","o":"79CA32","s":"_ZN5mongo7Request7processEPNS_16OperationContextEi"}
,{"b":"400000","o":"2B3AD5","s":"_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE"},{"b":"400000","o":"813DB5","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"},{"b":"7F9BF18B1000","o":"8184"},{"b":"7F9BF14E9000","o":"FDBED","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.10", "gitVersion" : "7
9d9b3ab5ce20f51c272b4411202710a082d0317", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.0-71-generic", "version" : "#92~14.04.1-Ubuntu SMP Fri Mar 24 15:22:50 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "63B4595F051D57181B7C28225EA834151DD88C94" }, { "b" : "7FFDDD5E5
000", "elfType" : 3, "buildId" : "B1EAD09775F74DEB311D03DBEE7B4C3DFA8939B1" }, { "b" : "7F9BF27D3000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "48A664AE6B0B4918A3EB0156C6364C4F084232FD" }, { "b" : "7F9BF23F7000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "6B8997EA892A
7FF37AC8CAA8F239D595251889BB" }, { "b" : "7F9BF21EF000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "1EEBA762A6A2C8884D56033EE8CCE79B95CD974D" }, { "b" : "7F9BF1FEB000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "D0F881E59FF88BE4F29A228C8657376B3C325C2C" }, { "b" : "7F9BF1CE5000", "p
ath" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1654CB13B1D24ED03F4BDCB51FC7524B9181A771" }, { "b" : "7F9BF1ACF000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "36311B4457710AE5578C4BF00791DED7359DBB92" }, { "b" : "7F9BF18B1000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "22F9078CFA529CCE1A814A4A1A1C018F169D5652" }, { "b" : "7F9BF14E9000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "CA5C6CFE528AF541C3C2C15CEE4B3C74DA4E2FB4" }, { "b" : "7F9BF2A32000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "237E22E5AAC2DDFCD06518F63FD720FE758E6E5B" } ] }}
 mongos(_ZN5mongo15printStackTraceERSo+0x32) [0xc68172]
 mongos(+0x8672C9) [0xc672c9]
 mongos(+0x867AD2) [0xc67ad2]
 libpthread.so.0(+0x10330) [0x7f9bf18c1330]
 libc.so.6(gsignal+0x37) [0x7f9bf151fc37]
 libc.so.6(abort+0x148) [0x7f9bf1523028]
 mongos(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0xbec7bb]
 mongos(_ZN5mongo20ClusterCursorManager19detachCursor_inlockERKNS_15NamespaceStringEx+0x1E8) [0xb92118]
 mongos(_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE+0x129) [0xb92399]
 mongos(_ZN5mongo20ClusterCursorManager12PinnedCursor12returnCursorENS0_11CursorStateE+0x41) [0xb92631]
 mongos(_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE+0x262) [0xb94b12]
 mongos(+0x74D4D6) [0xb4d4d6]
 mongos(_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderE+0x6E4) [0xb9d5f4]
 mongos(_ZN5mongo7Command20runAgainstRegisteredEPNS_16OperationContextEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x2ED) [0xb9e2bd]
 mongos(_ZN5mongo8Strategy15clientCommandOpEPNS_16OperationContextERNS_7RequestE+0x19D) [0xbac89d]
 mongos(_ZN5mongo7Request7processEPNS_16OperationContextEi+0x672) [0xb9ca32]
 mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0x65) [0x6b3ad5]
 mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x325) [0xc13db5]
 libpthread.so.0(+0x8184) [0x7f9bf18b9184]
 libc.so.6(clone+0x6D) [0x7f9bf15e6bed]
-----  END BACKTRACE  -----

Comment by Tess Avitabile (Inactive) [ 27/Feb/17 ]

The original reporter is no longer responsive, so we cannot obtain logs or further information.

Comment by Tess Avitabile (Inactive) [ 24/Feb/17 ]

The query team discussed the following possibilities in morning meeting and eliminated them as causes:

  • Failure to assign the result of a stdx::lock_guard<stdx::mutex> constructor.
  • Logical failure in keeping _cursorIdPrefixToNamespaceMap and _namespaceToContainerMap in sync.

I also looked into and eliminated the following possibilities:

  • Exception causing _cursorIdPrefixToNamespaceMap and _namespaceToContainerMap to be out of sync:
    In registerCursor(), we insert into _cursorIdPrefixToNamespaceMap before inserting into _namespaceToContainerMap. Thus we cannot have a namespace in _namespaceToContainerMap whose containerPrefix is not in _cursorIdPrefixToNamespaceMap. Then when we erase from _cursorIdPrefixToNamespaceMap and _namespaceToContainerMap in detachCursor_inlock, the second erase() is exception-safe.
  • Using a moved-from CursorEntryContainer:
    The only place where we move a CursorEntryContainer is here. Thus we cannot have any moved-from CursorEntryContainers in _namespaceToContainerMap.
Comment by Charlie Swanson [ 24/Jan/17 ]

I'm moving this to the query team's backlog, since it's my understanding that we own the cluster cursor manager.

Generated at Thu Feb 08 04:16:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.