|
A quick update: I just finished backporting SERVER-36926 to the 4.0 and 3.6 branches. I expect the fix to be released as part of versions 4.0.24 and 3.6.24.
|
|
A core dump provided by a customer who experienced this issue has finally allowed us to identify the root cause of this bug! In short, the problem was already reported in SERVER-36926. I am closing this ticket as a duplicate of SERVER-36926; I encourage watchers of this ticket to also watch SERVER-36926 for updates. We fixed SERVER-36926 in 4.1.3, and first released it as part of 4.2, which explains why this issue has not yet been observed in 4.2.x or 4.4.x versions. Given this root cause analysis, we will backport SERVER-36926 to the 4.0 and 3.6 branches. Read on for the technical details of the root cause analysis.
In the core dump, the cursor id associated with the getMore that caused mongos to crash is -9223372036236470444, or 0x8000000024da9754 when interpreting the bits as hex. This is interesting, because mongos treats the high-order 32 bits of the cursor id as special by assigning all cursors associated with a particular collection the same 32 bit prefix. The high order bits for our cursor id of interest are 0x80000000. This bit pattern is equivalent to INT_MIN when interpreted as a 2's complement 32 bit signed integer, which immediately points to SERVER-36926.
To elaborate, the 32 bit prefixes are assigned by generating a random int32 and taking its absolute value. SERVER-36926 observed that the call to std::abs() results in undefined behavior due to signed overflow if the int32 is INT_MIN. The fix for the bug introduced a special check for INT_MIN. In versions where the fix is not in place, it appears that what actually ends up happening is that INT_MIN is reinterpreted as the unsigned int 0x80000000, or 2147483648. In the core dump, 0x80000000 is the value for containerPrefix stored inside the relevant entry of the _namespaceToContainerMap.
For reasons I don't fully comprehend, the core dump proves that this scenario results in corruption of the _cursorIdPrefixToNamespaceMap. I can see from the core dump that the (0x80000000, "customer_db.customer_namespace") pair should be stored in bucket index 2 according to the hash function. However, this data is actually stored in a different bucket. It appears that the undefined behavior caused the hash to be computed incorrectly at insert time, resulting in a state where the data for key 0x80000000 is stored in the wrong bucket! The corrupted hash table causes the invariant to trip downstream, since the invariant is checking that the hash table is in the expected state during deletion of a cursor.
Thanks to ian.boros for aiding me in the diagnosis of this issue! Ian also helped create a repro to test our theory. In particular, we reproduced the issue by applying the following patch against the 4.0 branch and then running the cluster_cursor_manager_test unit test:
diff --git a/src/mongo/s/query/cluster_cursor_manager.cpp b/src/mongo/s/query/cluster_cursor_manager.cpp
|
index 5c267bde0a..529498b91a 100644
|
--- a/src/mongo/s/query/cluster_cursor_manager.cpp
|
+++ b/src/mongo/s/query/cluster_cursor_manager.cpp
|
@@ -250,7 +250,11 @@ StatusWith<CursorId> ClusterCursorManager::registerCursor(
|
// The server has always generated positive values for CursorId (which is a signed
|
// type), so we use std::abs() here on the prefix for consistency with this historical
|
// behavior.
|
- containerPrefix = static_cast<uint32_t>(std::abs(_pseudoRandom.nextInt32()));
|
+ auto rand = _pseudoRandom.nextInt32();
|
+ if (rand < 1000 * 1000 * 1000) {
|
+ rand = std::numeric_limits<int32_t>::min();
|
+ }
|
+ containerPrefix = static_cast<uint32_t>(std::abs(rand));
|
} while (_cursorIdPrefixToNamespaceMap.count(containerPrefix) > 0);
|
_cursorIdPrefixToNamespaceMap[containerPrefix] = nss;
|
Running this repro through gdb, I see a similar case of an inconsistent std::unordered_map where the data associated with key 0x80000000 appears in the wrong bucket.
|
|
Author:
{'name': 'Ian Boros', 'email': 'ian.boros@mongodb.com', 'username': 'puppyofkosh'}
Message: SERVER-47869 Add debug logging mechanism to mongos cursor manager
There is a bug in the mongos cursor manager tracked by SERVER-27796 which has never been reproduced
in our testing environment but has been encountered by multiple customers. This change adds a small
diagnostic log to the mongos cursor manager which is dumped when a symptom of the bug is found.
Branch: v4.0
https://github.com/mongodb/mongo/commit/30ebef0be7a29c9f7f7630615419eaead7c071c5
|
|
Author:
{'name': 'Ian Boros', 'email': 'ian.boros@mongodb.com', 'username': 'puppyofkosh'}
Message: SERVER-47869 Add debug logging mechanism to mongos cursor manager
There is a bug in the mongos cursor manager tracked by SERVER-27796 which has never been reproduced
in our testing environment but has been encountered by multiple customers. This change adds a small
diagnostic log to the mongos cursor manager which is dumped when a symptom of the bug is found.
Branch: v4.2
https://github.com/mongodb/mongo/commit/1c66cbdf54e8846022c47aa16e53e7c1750756e2
|
|
Author:
{'name': 'Ian Boros', 'email': 'ian.boros@mongodb.com', 'username': 'puppyofkosh'}
Message: SERVER-47869 Add debug logging mechanism to mongos cursor manager
There is a bug in the mongos cursor manager tracked by SERVER-27796 which has never been reproduced
in our testing environment but has been encountered by multiple customers. This change adds a small
diagnostic log to the mongos cursor manager which is dumped when a symptom of the bug is found.
Branch: v4.4
https://github.com/mongodb/mongo/commit/e1233abcbe7e5950c0a200f10277629352922750
|
|
To elaborate on Ian's comment, we think we need to gather more diagnostic information at the time of this crash in order to understand the problem. Several engineers have inspected the code since this issue was first reported without spotting the error. We've seen it in the wild a handful of times, but never in our numerous concurrency stress tests that run in our CI loop. We hope to catch another instance of this after SERVER-47869 is merged and released in order to gather more debug info.
|
|
Returning this to the backlog now that SERVER-47869 is in progress.
|
|
Thanks filip.havlicek@livesport.eu. Do let us know if you have any more details about what can provoke this crash. I've marked this ticket as affecting the 3.6 branch. So far we've seen it on 3.2, 3.4, and 3.6 but never on 4.0. My suspicion is that 4.0 is affected as well, however.
|
|
Unfortunately I don't have more details. It was the first time when it happened in our cluster. If it occurs again in the futere I will try to find more details.
I just want to let you know that this issue was seen on version 3.6 too.
|
|
filip.havlicek@livesport.eu, do you have any details about how to reproduce this crash, or at least knowledge of the workload that mongos was processing at the time of the crash? We've seen this happen a handful of times, but it seems to be rare and hard to reproduce, which has made it difficult to reach a root cause diagnosis.
CC ian.boros
|
|
probably the same/similar issue on 3.6
Debian Stretch (kernel 4.4.112)
package mongodb-org-mongos (3.6.7) from official repositor
mongos version v3.6.7
|
OpenSSL version: OpenSSL 1.1.0j 20 Nov 2018
|
allocator: tcmalloc
|
modules: none
|
git version: 2628472127e9f1826e02c665c1d93880a204075e
|
distmod: debian92
|
build environment:
|
target_arch: x86_64
|
distarch: x86_64
|
December 6th 2018, 01:20:29.886 [conn10467713] Invariant failure numDeleted == 1 src/mongo/s/query/cluster_cursor_manager.cpp 624
|
December 6th 2018, 01:20:29.889 [conn10467713]
|
***aborting after invariant() failure
|
December 6th 2018, 01:20:29.913 [conn10467713] Got signal: 6 (Aborted).
|
0x55971cbc79e1 0x55971cbc6bf9 0x55971cbc70dd 0x7f6500ed70c0 0x7f6500b59fff 0x7f6500b5b42a 0x55971bff3cd8 0x55971c3a46e8 0x55971c3a4946 0x55971c3a4b8d 0x55971c1477a6 0x55971c0d12d6 0x55971c4fc836 0x55971c4f7adf 0x55971c123f38 0x55971c124f93 0x55971c125679 0x55971c046145 0x55971c0629ea 0x55971c05e3a7 0x55971c0617e1 0x55971c4b8bc2 0x55971c05d20f 0x55971c05f755 0x55971c06004b 0x55971c05e42d 0x55971c0617e1 0x55971c4b9125 0x55971ca75da4 0x7f6500ecd494 0x7f6500c0facf
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"55971BAF0000","o":"10D79E1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"55971BAF0000","o":"10D6BF9"},{"b":"55971BAF0000","o":"10D70DD"},{"b":"7F6500EC6000","o":"110C0"},{"b":"7F6500B27000","o":"32FFF","s":"gsignal"},{"b":"7F6500B27000","o":"3442A","s":"abort"},{"b":"55971BAF0000","o":"503CD8","s":"_ZN5mongo22invariantFailedWithMsgEPKcS1_S1_j"},{"b":"55971BAF0000","o":"8B46E8","s":"_ZN5mongo20ClusterCursorManager13_detachCursorENS_8WithLockERKNS_15NamespaceStringEx"},{"b":"55971BAF0000","o":"8B4946","s":"_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE"},{"b":"55971BAF0000","o":"8B4B8D","s":"_ZN5mongo20ClusterCursorManager12PinnedCursor12returnCursorENS0_11CursorStateE"},{"b":"55971BAF0000","o":"6577A6","s":"_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE"},{"b":"55971BAF0000","o":"5E12D6"},{"b":"55971BAF0000","o":"A0C836","s":"_ZN5mongo12BasicCommand11enhancedRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE"},{"b":"55971BAF0000","o":"A07ADF","s":"_ZN5mongo7Command9publicRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE"},{"b":"55971BAF0000","o":"633F38"},{"b":"55971BAF0000","o":"634F93"},{"b":"55971BAF0000","o":"635679","s":"_ZN5mongo8Strategy13clientCommandEPNS_16OperationContextERKNS_7MessageE"},{"b":"55971BAF0000","o":"556145","s":"_ZN5mongo23ServiceEntryPointMongos13handleRequestEPNS_16OperationContextERKNS_7MessageE"},{"b":"55971BAF0000","o":"5729EA","s":"_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE"},{"b":"55971BAF0000","o":"56E3A7","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"55971BAF0000","o":"5717E1"},{"b":"55971BAF0000","o":"9C8BC2","s":"_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE"},{"b":"55971BAF0000","o":"56D20F","s":"_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE"},{"b":"55971BAF0000","o":"56F755","s":"_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE"},{"b":"55971BAF0000","o":"57004B","s":"_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE"},{"b":"55971BAF0000","o":"56E42D","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"55971BAF0000","o":"5717E1"},{"b":"55971BAF0000","o":"9C9125"},{"b":"55971BAF0000","o":"F85DA4"},{"b":"7F6500EC6000","o":"7494"},{"b":"7F6500B27000","o":"E8ACF","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.7", "gitVersion" : "2628472127e9f1826e02c665c1d93880a204075e", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.112-ls-3", "version" : "#3 SMP Mon Feb 12 14:56:37 CET 2018", "machine" : "x86_64" }, "somap" : [ { "b" : "55971BAF0000", "elfType" : 3, "buildId" : "1895B75D6C02803A30D7407F21FC01668D053F7D" }, { "b" : "7FFD87184000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "DEBF60FC8D64715B7432B531F79998366AB94138" }, { "b" : "7F6502109000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "713D47D5F599289C0A91ADE8F0122B2B4AA78B2E" }, { "b" : "7F6501C76000", "path" : "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1", "elfType" : 3, "buildId" : "2CFE882A331D7857E9CE1B5DE3255E6DA76EF899" }, { "b" : "7F6501A0A000", "path" : "/usr/lib/x86_64-linux-gnu/libssl.so.1.1", "elfType" : 3, "buildId" : "E2AA3B39763D943F56B3BD05C8E36E639BA95E12" }, { "b" : "7F6501806000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "B895F0831F623C5F23603401D4069F9F94C24761" }, { "b" : "7F65015FE000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "5D83E0642E645026DBB11F89F7DF7106BD821495" }, { "b" : "7F65012FA000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1B95E3A8B8788B07E4F59EE69B1877F9DEB42033" }, { "b" : "7F65010E3000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "51AD5FD294CD6C813BED40717347A53434B80B7A" }, { "b" : "7F6500EC6000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "4285CD3158DDE596765C747AE210AB6CBD258B22" }, { "b" : "7F6500B27000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "AA889E26A70F98FA8D230D088F7CC5BF43573163" }, { "b" : "7F6502320000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "263F909DBE11A66F7C6233E3FF0521148D9F8370" }, { "b" : "7F6500915000", "path" : "/lib/x86_64-linux-gnu/libnss_files.so.2", "elfType" : 3, "buildId" : "9D8B83EEDDF7CB86FF2022D501A98C310BD4C75B" }, { "b" : "7F650070F000", "path" : "/lib/x86_64-linux-gnu/libnss_dns.so.2", "elfType" : 3, "buildId" : "48C37C0EB2099FF3D698B9F6E006B7EB76994AD5" } ] }}
|
mongos(_ZN5mongo15printStackTraceERSo+0x41) [0x55971cbc79e1]
|
mongos(+0x10D6BF9) [0x55971cbc6bf9]
|
mongos(+0x10D70DD) [0x55971cbc70dd]
|
libpthread.so.0(+0x110C0) [0x7f6500ed70c0]
|
libc.so.6(gsignal+0xCF) [0x7f6500b59fff]
|
libc.so.6(abort+0x16A) [0x7f6500b5b42a]
|
mongos(_ZN5mongo22invariantFailedWithMsgEPKcS1_S1_j+0x0) [0x55971bff3cd8]
|
mongos(_ZN5mongo20ClusterCursorManager13_detachCursorENS_8WithLockERKNS_15NamespaceStringEx+0x218) [0x55971c3a46e8]
|
mongos(_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE+0x136) [0x55971c3a4946]
|
mongos(_ZN5mongo20ClusterCursorManager12PinnedCursor12returnCursorENS0_11CursorStateE+0x4D) [0x55971c3a4b8d]
|
mongos(_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE+0x1C6) [0x55971c1477a6]
|
mongos(+0x5E12D6) [0x55971c0d12d6]
|
mongos(_ZN5mongo12BasicCommand11enhancedRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE+0x76) [0x55971c4fc836]
|
mongos(_ZN5mongo7Command9publicRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE+0x1F) [0x55971c4f7adf]
|
mongos(+0x633F38) [0x55971c123f38]
|
mongos(+0x634F93) [0x55971c124f93]
|
mongos(_ZN5mongo8Strategy13clientCommandEPNS_16OperationContextERKNS_7MessageE+0x59) [0x55971c125679]
|
mongos(_ZN5mongo23ServiceEntryPointMongos13handleRequestEPNS_16OperationContextERKNS_7MessageE+0x5B5) [0x55971c046145]
|
mongos(_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE+0xBA) [0x55971c0629ea]
|
mongos(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0x97) [0x55971c05e3a7]
|
mongos(+0x5717E1) [0x55971c0617e1]
|
mongos(_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE+0x1A2) [0x55971c4b8bc2]
|
mongos(_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE+0x15F) [0x55971c05d20f]
|
mongos(_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE+0xAF5) [0x55971c05f755]
|
mongos(_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE+0x23B) [0x55971c06004b]
|
mongos(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0x11D) [0x55971c05e42d]
|
mongos(+0x5717E1) [0x55971c0617e1]
|
mongos(+0x9C9125) [0x55971c4b9125]
|
mongos(+0xF85DA4) [0x55971ca75da4]
|
libpthread.so.0(+0x7494) [0x7f6500ecd494]
|
libc.so.6(clone+0x3F) [0x7f6500c0facf]
|
----- END BACKTRACE -----
|
|
|
I read through through the code some more with David Storch, and he noticed that the call to std::abs(_pseudoRandom.nextInt32()) here could result in undefined behavior if _pseudoRandom.nextInt32() returns MIN_INT. We suspect that most compilers will do something "reasonable" in this situation, however, and are doubtful that it's the root cause of the bug (though I filed SERVER-36926). We also still have this code on the master branch, but have never seen the bug on versions other than 3.2.
|
|
For our posterity, here's a symbolized version of the stack trace posted by Nuno Costa.
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/util/stacktrace_posix.cpp:171:0: mongo::printStackTrace(std::ostream&)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/util/signal_handlers_synchronous.cpp:180:0: mongo::(anonymous namespace)::printSignalAndBacktrace(int)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/util/signal_handlers_synchronous.cpp:237:0: mongo::(anonymous namespace)::abruptQuit(int)
|
??:0:0: ??
|
??:0:0: ??
|
??:0:0: ??
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/util/assert_util.cpp:154:0: mongo::invariantFailed(char const*, char const*, unsigned int)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/query/cluster_cursor_manager.cpp:472:0: mongo::ClusterCursorManager::detachCursor_inlock(mongo::NamespaceString const&, long long)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/query/cluster_cursor_manager.cpp:310:0: mongo::ClusterCursorManager::checkInCursor(std::unique_ptr<mongo::ClusterClientCursor, std::default_delete<mongo::ClusterClientCursor> >, mongo::NamespaceString const&, long long, mongo::ClusterCursorManager::CursorState)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/query/cluster_cursor_manager.cpp:131:0: mongo::ClusterCursorManager::PinnedCursor::returnCursor(mongo::ClusterCursorManager::CursorState)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/query/cluster_find.cpp:474:0: mongo::ClusterFind::runGetMore(mongo::OperationContext*, mongo::GetMoreRequest const&)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/commands/cluster_getmore_cmd.cpp:108:0: mongo::(anonymous namespace)::ClusterGetMoreCmd::run(mongo::OperationContext*, std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/s_only.cpp:128:0: mongo::Command::execCommandClientBasic(mongo::OperationContext*, mongo::Command*, mongo::ClientBasic&, int, char const*, mongo::BSONObj&, mongo::BSONObjBuilder&)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/s_only.cpp:171:0: mongo::Command::runAgainstRegistered(mongo::OperationContext*, char const*, mongo::BSONObj&, mongo::BSONObjBuilder&, int)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/strategy.cpp:237:0: mongo::Strategy::clientCommandOp(mongo::OperationContext*, mongo::Request&)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/request.cpp:109:0: mongo::Request::process(mongo::OperationContext*, int)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/s/server.cpp:200:0: mongo::ShardedMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*)
|
/data/mci/389eeb60e2dee304bdc665c6487a1955/src/src/mongo/util/net/message_server_port.cpp:227:0: mongo::PortMessageServer::handleIncomingMsg(void*)
|
??:0:0: ??
|
??:0:0: ??
|
A few more observations:
1) This invariant failure has only happened while we were returning a cursor to the cursor manager via ClusterFind::runGetMore. The other code path which may detach/destroy cursors is from the cursor reaper. It's strange that we've never seen this invariant trigger from that path.
2) At the time we tried to use invalidated_cursors.js, there was a bug (see SERVER-33603) which was preventing it from running getMores on the correct collection. I re-ran the test with the bug fixed but still wasn't able to reproduce the problem.
3) I also made some attempts at running the test with
a) Fewer possible cursor IDs (in case there is some kind of cursor collision happening)
b) The ClusterCursorCleanupJob running more frequently
We can also try to reason about what must have happened based on the invariant failure, though this approach hasn't gotten me very far.
-The invariant is that numDeleted is 1.
-numDeleted is the return value from a call to std::map::erase, which is the number of elements removed. For a std::map, either 0 or 1. So numDeleted was 0.
-The way I see it, there are two options:
(a) _cursorIdPrefixToNamespaceMap is missing elements(it is out of sync with _namespaceToContainerMap)
(b) the cursorPrefix which we attempted to erase here is bogus.
(a) Does not seem plausbile. _namespaceToContainerMap and _cursorIdPrefixToNamespaceMap each only have one site where an element is added (registerCursor) and one place where an element is deleted (detachCursor_inlock). In both places each map is updated.
I have to spend some more time looking into (b).
I think my remaining steps will be to
-Look into option (b)
-Look into whether auth failures could cause this problem (Many of the logs posted by Nuno Costa had a lot of auth failures soon before the invariant triggered).
|
|
We got same issue on a mongos.
If you want additional info, please let me know.
Version: mongodb-linux-x86_64-3.2.4
OS: CentOS Linux release 7.2.1511 (Core)
2017-09-06T00:14:43.579+0900 I - [conn225] Invariant failure numDeleted == 1 src/mongo/s/query/cluster_cursor_manager.cpp 472
|
2017-09-06T00:14:43.579+0900 I - [conn225]
|
|
***aborting after invariant() failure
|
|
|
2017-09-06T00:14:43.593+0900 F - [conn225] Got signal: 6 (Aborted).
|
|
0xc12fc2 0xc11ee9 0xc126f2 0x7f1488ebe100 0x7f1488b235f7 0x7f1488b24ce8 0xba7c7b 0xb48ee8 0xb49169 0xb49401 0xb4b932 0xb01bd6 0xb545b4 0xb5527d 0xb63c3d 0xb53bd6 0x646115 0xbcd3fd 0x7f1488eb6dc5 0x7f1488be428d
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"812FC2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"811EE9"},{"b":"400000","o":"8126F2"},{"b":"7F1488EAF000","o":"F100"},{"b":"7F1488AEE000","o":"355F7","s":"gsignal"},{"b":"7F1488AEE000","o":"36CE8","s":"abort"},{"b":"400000","o":"7A7C7B","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"400000","o":"748EE8","s":"_ZN5mongo20ClusterCursorManager19detachCursor_inlockERKNS_15NamespaceStringEx"},{"b":"400000","o":"749169","s":"_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE"},{"b":"400000","o":"749401","s":"_ZN5mongo20ClusterCursorManager12PinnedCursor12returnCursorENS0_11CursorStateE"},{"b":"400000","o":"74B932","s":"_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE"},{"b":"400000","o":"701BD6"},{"b":"400000","o":"7545B4","s":"_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderE"},{"b":"400000","o":"75527D","s":"_ZN5mongo7Command20runAgainstRegisteredEPNS_16OperationContextEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi"},{"b":"400000","o":"763C3D","s":"_ZN5mongo8Strategy15clientCommandOpEPNS_16OperationContextERNS_7RequestE"},{"b":"400000","o":"753BD6","s":"_ZN5mongo7Request7processEPNS_16OperationContextEi"},{"b":"400000","o":"246115","s":"_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE"},{"b":"400000","o":"7CD3FD","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"},{"b":"7F1488EAF000","o":"7DC5"},{"b":"7F1488AEE000","o":"F628D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.4", "gitVersion" : "e2ee9ffcf9f5a94fad76802e28cc978718bb7a30", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.10.0-327.36.3.el7.x86_64", "version" : "#1 SMP Mon Oct 24 16:09:20 UTC 2016", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFC2F889000", "elfType" : 3 }, { "b" : "7F1489AEF000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7F14898EB000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7F14895E3000", "path" : "/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F14892E1000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7F14890CB000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F1488EAF000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7F1488AEE000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7F1489CF7000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
|
mongos(_ZN5mongo15printStackTraceERSo+0x32) [0xc12fc2]
|
mongos(+0x811EE9) [0xc11ee9]
|
mongos(+0x8126F2) [0xc126f2]
|
libpthread.so.0(+0xF100) [0x7f1488ebe100]
|
libc.so.6(gsignal+0x37) [0x7f1488b235f7]
|
libc.so.6(abort+0x148) [0x7f1488b24ce8]
|
mongos(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0xba7c7b]
|
mongos(_ZN5mongo20ClusterCursorManager19detachCursor_inlockERKNS_15NamespaceStringEx+0x1E8) [0xb48ee8]
|
mongos(_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE+0x129) [0xb49169]
|
mongos(_ZN5mongo20ClusterCursorManager12PinnedCursor12returnCursorENS0_11CursorStateE+0x41) [0xb49401]
|
mongos(_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE+0x262) [0xb4b932]
|
mongos(+0x701BD6) [0xb01bd6]
|
mongos(_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderE+0x6E4) [0xb545b4]
|
mongos(_ZN5mongo7Command20runAgainstRegisteredEPNS_16OperationContextEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x2ED) [0xb5527d]
|
mongos(_ZN5mongo8Strategy15clientCommandOpEPNS_16OperationContextERNS_7RequestE+0x19D) [0xb63c3d]
|
mongos(_ZN5mongo7Request7processEPNS_16OperationContextEi+0x876) [0xb53bd6]
|
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0x65) [0x646115]
|
mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x26D) [0xbcd3fd]
|
libpthread.so.0(+0x7DC5) [0x7f1488eb6dc5]
|
libc.so.6(clone+0x6D) [0x7f1488be428d]
|
----- END BACKTRACE -----
|
|
|
|
Resolving this as Cannot Reproduce since we currently don't have enough info to proceed on. Please do reopen if you have any other information you can provide.
|
|
I adapted the invalidated_cursors.js test that david.storch mentioned so that it runs continuously in a sharded configuration, but after running it for ~48 hours, it did not produce a failure.
I'm going to return this to the backlog for now, so that we can come back to it when we have more information. That will probably mean waiting for it to reproduce in another context.
Steffen Thanks again for the detailed information and logs. When this reproduces, our team is going to be looking to see if the workload includes non-query commands that return cursors (like aggregate or listCollections) and also if there is the possibility that a collection got dropped around the time of the invariant failure. If you know whether those apply to your workload, we'd also be interested in knowing that.
Additionally, any information that you have about what commands or queries were running on mongos at the time of the failure and any general workload information that you have would be a huge help. Thanks!
|
|
Thanks for uploading the logs, steffen. The Query Team has analyzed them, but we were unable to find any useful information there. We have also done a code inspection of the ClusterCursorManager looking for a coding error that is consistent with this symptom. That has been a dead end so far.
We noticed that we have an existing test which attempts to stress concurrent cursor management: invalidated_cursors.js. However, this test has been blacklisted in a sharded configuration for an unrelated reason. The next action on this investigation should be to adapt the test to run sharded, and see if after many runs in our continuous integration loop we can reproduce this behavior.
|
|
Thanks, FYI: Sofar we don't know how to reproduce this. Also it's first time we noticed it.
|
|
Hi Steffen,
I've created a secure upload portal for the logs. Files uploaded to this portal are visible only to MongoDB employees investigating this issue and are routinely deleted after some time.
Thank you,
Thomas
|
- 27 Shards and 138 mongos
- I can upload it to via private upload
- We don't know
- Both, but more reads
|
|
Hello Steffen,
Thank you for reporting this. To help us investigate this issue, can you provide us with the following information?
- What is the topology of the cluster? How many mongoses and shards?
- Can you provide us with the logs on the mongos leading up to the crash? (We can provide a secure upload site if you would not like to attach the logs to this ticket.)
- This crash occurred while running a getMore. Do you happen to know what namespace (database and collection) this getMore was on? Was this namespace sharded?
- What is the workload like on this cluster? Mostly reads or writes?
Thank you,
The Query Team
|
|
We seem to just hit same issue on a mongos:
MongoDB 3.2.10
Linux 4.4.0-71-generic #92~14.04.1-Ubuntu SMP Fri Mar 24 15:22:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
2017-04-27T06:20:02.388+0000 I CONTROL [signalProcessingThread] db version v3.2.10
|
2017-04-27T06:20:02.388+0000 I CONTROL [signalProcessingThread] git version: 79d9b3ab5ce20f51c272b4411202710a082d0317
|
2017-04-27T06:20:02.388+0000 I CONTROL [signalProcessingThread] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
|
2017-04-27T06:20:02.388+0000 I CONTROL [signalProcessingThread] allocator: tcmalloc
|
2017-04-27T06:20:02.388+0000 I CONTROL [signalProcessingThread] modules: none
|
2017-04-27T06:20:02.388+0000 I CONTROL [signalProcessingThread] build environment:
|
2017-04-27T06:20:02.388+0000 I CONTROL [signalProcessingThread] distmod: ubuntu1404
|
2017-04-27T06:20:02.388+0000 I CONTROL [signalProcessingThread] distarch: x86_64
|
2017-04-27T06:20:02.388+0000 I CONTROL [signalProcessingThread] target_arch: x86_64
|
2017-04-27T08:45:30.824+0000 I - [conn3676] Invariant failure numDeleted == 1 src/mongo/s/query/cluster_cursor_manager.cpp 472
|
2017-04-27T08:45:30.824+0000 I - [conn3676]
|
|
***aborting after invariant() failure
|
|
|
2017-04-27T08:45:30.858+0000 F - [conn3676] Got signal: 6 (Aborted).
|
|
0xc68172 0xc672c9 0xc67ad2 0x7f9bf18c1330 0x7f9bf151fc37 0x7f9bf1523028 0xbec7bb 0xb92118 0xb92399 0xb92631 0xb94b12 0xb4d4d6 0xb9d5f4 0xb9e2bd 0xbac89d 0xb9ca32 0x6b3ad5 0xc13db5 0x7f9bf18b9184 0x7f9bf15e6bed
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"868172","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"8672C9"},{"b":"400000","o":"867AD2"},{"b":"7F9BF18B1000","o":"10330"},{"b":"7F9BF14E9000","o":"36C37","s":"gsignal"},{"b":"7F9BF14E9000","o":"3A028","s":"abort"},{"b":"400000","o":"7EC7BB","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"400000",
|
"o":"792118","s":"_ZN5mongo20ClusterCursorManager19detachCursor_inlockERKNS_15NamespaceStringEx"},{"b":"400000","o":"792399","s":"_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE"},{"b":"400000","o":"792631","s":"_ZN5mongo20ClusterCursorM
|
anager12PinnedCursor12returnCursorENS0_11CursorStateE"},{"b":"400000","o":"794B12","s":"_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE"},{"b":"400000","o":"74D4D6"},{"b":"400000","o":"79D5F4","s":"_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BS
|
ONObjBuilderE"},{"b":"400000","o":"79E2BD","s":"_ZN5mongo7Command20runAgainstRegisteredEPNS_16OperationContextEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi"},{"b":"400000","o":"7AC89D","s":"_ZN5mongo8Strategy15clientCommandOpEPNS_16OperationContextERNS_7RequestE"},{"b":"400000","o":"79CA32","s":"_ZN5mongo7Request7processEPNS_16OperationContextEi"}
|
,{"b":"400000","o":"2B3AD5","s":"_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE"},{"b":"400000","o":"813DB5","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"},{"b":"7F9BF18B1000","o":"8184"},{"b":"7F9BF14E9000","o":"FDBED","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.10", "gitVersion" : "7
|
9d9b3ab5ce20f51c272b4411202710a082d0317", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.0-71-generic", "version" : "#92~14.04.1-Ubuntu SMP Fri Mar 24 15:22:50 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "63B4595F051D57181B7C28225EA834151DD88C94" }, { "b" : "7FFDDD5E5
|
000", "elfType" : 3, "buildId" : "B1EAD09775F74DEB311D03DBEE7B4C3DFA8939B1" }, { "b" : "7F9BF27D3000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "48A664AE6B0B4918A3EB0156C6364C4F084232FD" }, { "b" : "7F9BF23F7000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "6B8997EA892A
|
7FF37AC8CAA8F239D595251889BB" }, { "b" : "7F9BF21EF000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "1EEBA762A6A2C8884D56033EE8CCE79B95CD974D" }, { "b" : "7F9BF1FEB000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "D0F881E59FF88BE4F29A228C8657376B3C325C2C" }, { "b" : "7F9BF1CE5000", "p
|
ath" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1654CB13B1D24ED03F4BDCB51FC7524B9181A771" }, { "b" : "7F9BF1ACF000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "36311B4457710AE5578C4BF00791DED7359DBB92" }, { "b" : "7F9BF18B1000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "22F9078CFA529CCE1A814A4A1A1C018F169D5652" }, { "b" : "7F9BF14E9000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "CA5C6CFE528AF541C3C2C15CEE4B3C74DA4E2FB4" }, { "b" : "7F9BF2A32000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "237E22E5AAC2DDFCD06518F63FD720FE758E6E5B" } ] }}
|
mongos(_ZN5mongo15printStackTraceERSo+0x32) [0xc68172]
|
mongos(+0x8672C9) [0xc672c9]
|
mongos(+0x867AD2) [0xc67ad2]
|
libpthread.so.0(+0x10330) [0x7f9bf18c1330]
|
libc.so.6(gsignal+0x37) [0x7f9bf151fc37]
|
libc.so.6(abort+0x148) [0x7f9bf1523028]
|
mongos(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0xbec7bb]
|
mongos(_ZN5mongo20ClusterCursorManager19detachCursor_inlockERKNS_15NamespaceStringEx+0x1E8) [0xb92118]
|
mongos(_ZN5mongo20ClusterCursorManager13checkInCursorESt10unique_ptrINS_19ClusterClientCursorESt14default_deleteIS2_EERKNS_15NamespaceStringExNS0_11CursorStateE+0x129) [0xb92399]
|
mongos(_ZN5mongo20ClusterCursorManager12PinnedCursor12returnCursorENS0_11CursorStateE+0x41) [0xb92631]
|
mongos(_ZN5mongo11ClusterFind10runGetMoreEPNS_16OperationContextERKNS_14GetMoreRequestE+0x262) [0xb94b12]
|
mongos(+0x74D4D6) [0xb4d4d6]
|
mongos(_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderE+0x6E4) [0xb9d5f4]
|
mongos(_ZN5mongo7Command20runAgainstRegisteredEPNS_16OperationContextEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x2ED) [0xb9e2bd]
|
mongos(_ZN5mongo8Strategy15clientCommandOpEPNS_16OperationContextERNS_7RequestE+0x19D) [0xbac89d]
|
mongos(_ZN5mongo7Request7processEPNS_16OperationContextEi+0x672) [0xb9ca32]
|
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0x65) [0x6b3ad5]
|
mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x325) [0xc13db5]
|
libpthread.so.0(+0x8184) [0x7f9bf18b9184]
|
libc.so.6(clone+0x6D) [0x7f9bf15e6bed]
|
----- END BACKTRACE -----
|
|
|
The original reporter is no longer responsive, so we cannot obtain logs or further information.
|
|
The query team discussed the following possibilities in morning meeting and eliminated them as causes:
- Failure to assign the result of a stdx::lock_guard<stdx::mutex> constructor.
- Logical failure in keeping _cursorIdPrefixToNamespaceMap and _namespaceToContainerMap in sync.
I also looked into and eliminated the following possibilities:
- Exception causing _cursorIdPrefixToNamespaceMap and _namespaceToContainerMap to be out of sync:
In registerCursor(), we insert into _cursorIdPrefixToNamespaceMap before inserting into _namespaceToContainerMap. Thus we cannot have a namespace in _namespaceToContainerMap whose containerPrefix is not in _cursorIdPrefixToNamespaceMap. Then when we erase from _cursorIdPrefixToNamespaceMap and _namespaceToContainerMap in detachCursor_inlock, the second erase() is exception-safe.
- Using a moved-from CursorEntryContainer:
The only place where we move a CursorEntryContainer is here. Thus we cannot have any moved-from CursorEntryContainers in _namespaceToContainerMap.
|
|
I'm moving this to the query team's backlog, since it's my understanding that we own the cluster cursor manager.
|
Generated at Thu Feb 08 04:16:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.