[SERVER-26159] mongos crashes: Invariant failure !op->timedOut() src/mongo/executor/network_interface_asio.cpp Created: 19/Sep/16  Updated: 08/Jan/24  Resolved: 21/Sep/16

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: 3.2.10, 3.4.0-rc0

Type: Bug Priority: Blocker - P1
Reporter: Xiaoguang Wang Assignee: Samantha Ritter (Inactive)
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Platforms 2016-09-19, Platforms 2016-10-10
Participants:
Case:

 Description   

We met the bug SERVER-25465, after we upgraded our mongos to 3.2.10-rc0, the mongos still crashes:

2016-09-19T21:15:54.082+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-1-0] Connecting to 192.168.20.139:47002
2016-09-19T21:15:54.083+0800 I -        [conn2251] Invariant failure !op->timedOut() src/mongo/executor/network_interface_asio.cpp 237
2016-09-19T21:15:54.084+0800 I -        [conn2251]
 
***aborting after invariant() failure
 
 
2016-09-19T21:15:54.089+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-24-0] Failed to connect to 192.168.20.116:47003 - ExceededTimeLimit: Operation timed out
2016-09-19T21:15:54.093+0800 F -        [conn2251] Got signal: 6 (Aborted).
 
 0xc76992 0xc758c9 0xc760d2 0x38d720f7e0 0x38d6e325e5 0x38d6e33dc5 0xbf817b 0xa15aac 0xa15cae 0xa03c3b 0xa011eb 0xa02a60 0xa02bc9 0xa13867 0xa35bb7 0xb94f5d 0xb95a8d 0xba271b 0xb99392 0xb9f606 0xb5135e 0xba7554 0xba821d 0xbb699d 0xba6952 0x698c85 0xc20a45 0x38d7207aa1 0x38d6ee8aad
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"876992","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"8758C9"},{"b":"400000","o":"8760D2"},{"b":"38D7200000","o":"F7E0"},{"b":"38D6E00000","o":"325E5","s":"gsignal"},{"b":"38D6E00000","o":"33DC5","s":"abort"},{"b":"400000","o":"7F817B","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"400000","o":"615AAC"},{"b":"400000","o":"615CAE"},{"b":"400000","o":"603C3B","s":"_ZNKSt8functionIFvN5mongo10StatusWithISt10unique_ptrINS0_8executor14ConnectionPool19ConnectionInterfaceENS4_23ConnectionHandleDeleterEEEEEEclES8_"},{"b":"400000","o":"6011EB","s":"_ZN5mongo8executor14ConnectionPool12SpecificPool15fulfillRequestsERSt11unique_lockISt5mutexE"},{"b":"400000","o":"602A60","s":"_ZN5mongo8executor14ConnectionPool12SpecificPool13getConnectionERKNS_11HostAndPortENSt6chrono8durationIlSt5ratioILl1ELl1000EEEESt11unique_lockISt5mutexESt8functionIFvNS_10StatusWithISt10unique_ptrINS1_19ConnectionInterfaceENS1_23ConnectionHandleDeleterEEEEEE"},{"b":"400000","o":"602BC9","s":"_ZN5mongo8executor14ConnectionPool3getERKNS_11HostAndPortENSt6chrono8durationIlSt5ratioILl1ELl1000EEEESt8functionIFvNS_10StatusWithISt10unique_ptrINS1_19ConnectionInterfaceENS1_23ConnectionHandleDeleterEEEEEE"},{"b":"400000","o":"613867","s":"_ZN5mongo8executor20NetworkInterfaceASIO12startCommandERKNS0_12TaskExecutor14CallbackHandleERKNS0_20RemoteCommandRequestERKSt8functionIFvRKNS_10StatusWithINS0_21RemoteCommandResponseEEEEE"},{"b":"400000","o":"635BB7","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor21scheduleRemoteCommandERKNS0_20RemoteCommandRequestERKSt8functionIFvRKNS0_12TaskExecutor25RemoteCommandCallbackArgsEEE"},{"b":"400000","o":"794F5D","s":"_ZN5mongo18AsyncResultsMerger22askForNextBatch_inlockEm"},{"b":"400000","o":"795A8D","s":"_ZN5mongo18AsyncResultsMerger9nextEventEv"},{"b":"400000","o":"7A271B","s":"_ZN5mongo16RouterStageMerge4nextEv"},{"b":"400000","o":"799392","s":"_ZN5mongo23ClusterClientCursorImpl4nextEv"},{"b":"400000","o":"79F606","s":"_ZN5mongo11ClusterFind8runQueryEPNS_16OperationContextERKNS_14CanonicalQueryERKNS_21ReadPreferenceSettingEPSt6vectorINS_7BSONObjESaISA_EE"},{"b":"400000","o":"75135E"},{"b":"400000","o":"7A7554","s":"_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderE"},{"b":"400000","o":"7A821D","s":"_ZN5mongo7Command20runAgainstRegisteredEPNS_16OperationContextEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi"},{"b":"400000","o":"7B699D","s":"_ZN5mongo8Strategy15clientCommandOpEPNS_16OperationContextERNS_7RequestE"},{"b":"400000","o":"7A6952","s":"_ZN5mongo7Request7processEPNS_16OperationContextEi"},{"b":"400000","o":"298C85","s":"_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE"},{"b":"400000","o":"820A45","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"},{"b":"38D7200000","o":"7AA1"},{"b":"38D6E00000","o":"E8AAD","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.10-rc0", "gitVersion" : "7d2acd6395ec84beca34718a75371bc11f0c9f60", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "2.6.32-504.el6.x86_64", "version" : "#1 SMP Wed Oct 15 04:27:16 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "8C57070E980059A0C2F2EBB18EB92FC97271C6B9" }, { "b" : "7FFFA54FF000", "elfType" : 3, "buildId" : "08E42C6C3D2CD1E5D68A43B717C9EB3D310F2DF0" }, { "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "7C5A504A21B221F299B1C45B9ED9C2340AEC6AEB" }, { "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "3E83FD82B1AE56360B270189F636423FB3016594" }, { "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "86FE5BC1F46B8F8AA9A7A479FF991900DB93F720" }, { "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "045D39F19533291EBD72D0EE0247F9D49BE2521E" }, { "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "B60EF3FCE5D4D2D8BAD2585D5CAAA1167B35DBFD" }, { "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "246C3BAB0AB093AFD59D34C8CBF29E786DE4BE97" }, { "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "B3BD4C596D72FCBE4607C86FEEC14F47B46D0DCC" }, { "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "F14F3F5B55FC325FDFA7FB6F22F73CA1C6977379" }, { "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "DB04BDF55FE66BADD1081D833C17B4881E0EEDB9" }, { "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "B7F7FF323B3A4A12310A6285412F01ACE8C74E47" }, { "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "7920917F74AFAD0B8CB197CABBE472AF39D94C34" }, { "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "8CE28F280150E62296240E70ECAC64E4A57AB826" }, { "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "05733977F4E41652B86070B27A0CFC2C1EA7719D" }, { "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "C8D01C2839F6950988CE32B4266A8F89C521ACB0" }, { "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "b" : "7F030466A000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "DDF6449707FD4C49DDE32A293EEE9AC218BFC460" }, { "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "E6798A06BEE17CF102BBA44FD512FF8B805CEAF1" } ] }}
 mongos(_ZN5mongo15printStackTraceERSo+0x32) [0xc76992]
 mongos(+0x8758C9) [0xc758c9]
 mongos(+0x8760D2) [0xc760d2]
 libpthread.so.0(+0xF7E0) [0x38d720f7e0]
 libc.so.6(gsignal+0x35) [0x38d6e325e5]
 libc.so.6(abort+0x175) [0x38d6e33dc5]
 mongos(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0xbf817b]
 mongos(+0x615AAC) [0xa15aac]
 mongos(+0x615CAE) [0xa15cae]
 mongos(_ZNKSt8functionIFvN5mongo10StatusWithISt10unique_ptrINS0_8executor14ConnectionPool19ConnectionInterfaceENS4_23ConnectionHandleDeleterEEEEEEclES8_+0x4B) [0xa03c3b]
 mongos(_ZN5mongo8executor14ConnectionPool12SpecificPool15fulfillRequestsERSt11unique_lockISt5mutexE+0x2AB) [0xa011eb]
 mongos(_ZN5mongo8executor14ConnectionPool12SpecificPool13getConnectionERKNS_11HostAndPortENSt6chrono8durationIlSt5ratioILl1ELl1000EEEESt11unique_lockISt5mutexESt8functionIFvNS_10StatusWithISt10unique_ptrINS1_19ConnectionInterfaceENS1_23ConnectionHandleDeleterEEEEEE+0x270) [0xa02a60]
 mongos(_ZN5mongo8executor14ConnectionPool3getERKNS_11HostAndPortENSt6chrono8durationIlSt5ratioILl1ELl1000EEEESt8functionIFvNS_10StatusWithISt10unique_ptrINS1_19ConnectionInterfaceENS1_23ConnectionHandleDeleterEEEEEE+0xE9) [0xa02bc9]
 mongos(_ZN5mongo8executor20NetworkInterfaceASIO12startCommandERKNS0_12TaskExecutor14CallbackHandleERKNS0_20RemoteCommandRequestERKSt8functionIFvRKNS_10StatusWithINS0_21RemoteCommandResponseEEEEE+0x3A7) [0xa13867]
 mongos(_ZN5mongo8executor22ThreadPoolTaskExecutor21scheduleRemoteCommandERKNS0_20RemoteCommandRequestERKSt8functionIFvRKNS0_12TaskExecutor25RemoteCommandCallbackArgsEEE+0x7F7) [0xa35bb7]
 mongos(_ZN5mongo18AsyncResultsMerger22askForNextBatch_inlockEm+0x20D) [0xb94f5d]
 mongos(_ZN5mongo18AsyncResultsMerger9nextEventEv+0x1AD) [0xb95a8d]
 mongos(_ZN5mongo16RouterStageMerge4nextEv+0x4B) [0xba271b]
 mongos(_ZN5mongo23ClusterClientCursorImpl4nextEv+0x132) [0xb99392]
 mongos(_ZN5mongo11ClusterFind8runQueryEPNS_16OperationContextERKNS_14CanonicalQueryERKNS_21ReadPreferenceSettingEPSt6vectorINS_7BSONObjESaISA_EE+0xBD6) [0xb9f606]
 mongos(+0x75135E) [0xb5135e]
 mongos(_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderE+0x6E4) [0xba7554]
 mongos(_ZN5mongo7Command20runAgainstRegisteredEPNS_16OperationContextEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x2ED) [0xba821d]
 mongos(_ZN5mongo8Strategy15clientCommandOpEPNS_16OperationContextERNS_7RequestE+0x19D) [0xbb699d]
 mongos(_ZN5mongo7Request7processEPNS_16OperationContextEi+0x672) [0xba6952]
 mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0x65) [0x698c85]
 mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x325) [0xc20a45]
 libpthread.so.0(+0x7AA1) [0x38d7207aa1]
 libc.so.6(clone+0x6D) [0x38d6ee8aad]
-----  END BACKTRACE  -----
2016-09-19T21:15:54.100+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-21-0] Connecting to 192.168.20.78:47005
2016-09-19T21:15:54.100+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-21-0] Connecting to 192.168.20.186:47001
2016-09-19T21:15:54.100+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-21-0] Connecting to 192.168.20.78:47005
2016-09-19T21:15:54.100+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-21-0] Connecting to 192.168.20.186:47001
2016-09-19T21:15:54.100+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-21-0] Connecting to 192.168.20.116:47003
2016-09-19T21:15:54.100+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-21-0] Connecting to 192.168.20.116:47003
2016-09-19T21:15:54.100+0800 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-21-0] Connecting to 192.168.20.118:47004



 Comments   
Comment by Xiaoguang Wang [ 21/Sep/16 ]

Thanks for your quick response and fix.

We decided to disable "move chunk" and "auto split" in our production environment because we have experienced the mongo cluster no-response problem at least 3 times recently, we are sure these problems are caused by "move chunk" and "auto split" with high workload.

And, this crash(SERVER-26159) only occurs when auto splitting with high workload which causes the mongos instances can not connect to mongod. (Unfortunately, we don't have chance to enable auto splitting in our production environment because we have to keep our service stable recently.)

It is suspected that the design of "chunk" has defects. Can you look into the "move chunk" and "auto split" procedure to prevent mongo cluster stopping response under high workload?

Comment by Samantha Ritter (Inactive) [ 21/Sep/16 ]

Hi wxiaoguang@gmail.com,

Thank you very much for the information and log files. I have found a race condition in our connection code. The race is triggered when communication between cluster members takes about as long as our timeout value, so it makes sense that you encountered it when mongos was stuck trying to refresh. I'm very sorry that this bug has been interfering with your service, especially after SERVER-25465.

I've merged in the fix for this race condition to both our v3.2 and master branches. This fix will be available in 3.2.10-rc1, which we will release shortly. If you are willing to upgrade to rc1 when it comes out, we are hopeful that it will fix this issue for you. If rc1 does not fix the issue this will continue to be a top priority for us.

I'm going to resolve this ticket. If you continue to see this issue in 3.2.10-rc1, please don't hesitate to open a new ticket so we can address the problem right away.

Thanks,
Sam

Comment by Githook User [ 21/Sep/16 ]

Author:

{u'username': u'samantharitter', u'name': u'samantharitter', u'email': u'samantha.ritter@10gen.com'}

Message: SERVER-26159 Increment the AsyncOp id immediately so timeout code will not run
Branch: master
https://github.com/mongodb/mongo/commit/9d9ef05168fa45cc16f176f1b6387073f7b32689

Comment by Kelsey Schubert [ 21/Sep/16 ]

Hi wxiaoguang@gmail.com,

Thank you for answering Sam's questions. Please be aware that I've moved the uploaded files to the secure upload portal that she provided.

Kind regards,
Thomas

Comment by Xiaoguang Wang [ 21/Sep/16 ]

> Are you running a replica set or a sharded cluster? How many nodes?
5 mongos instances
5 shards (1 primary, 2 repl, 2 arbitrary for every shard)
3 config instances (1 primary, 2 secondary)
all: WT engine, XFS file system

Comment by Xiaoguang Wang [ 21/Sep/16 ]

> How frequently have you seen this crash?
Almost every time when the server is busy.
Especially when the mongos crashed, even if we restarted it again, it would crash soon because too many incoming requests.
Our service is down for 2 hours because this bug. We even couldn't not make the mongos running.

> Is it something you are able to reproduce consistently?
> If so, what sort of workload are you running when you encounter the bug?
I think a related case is mongos is doing auto-splitting.
It caused the config information not synced between mongos instances.
When the mongos instances are trying to refresh config information, the mongos blocks, and timeout occurs.
Yesterday we disabled auto-splitting of mongos, there is no crash again.

> Would you please upload an archive of the diagnostic.data directory from the dbpath of a primary node?
> Are you running a replica set or a sharded cluster? How many nodes?
> Would you please upload the complete logs from the crashed server?
I will ask our DBA.

Comment by Githook User [ 20/Sep/16 ]

Author:

{u'username': u'samantharitter', u'name': u'samantharitter', u'email': u'samantha.ritter@10gen.com'}

Message: SERVER-26159 Increment the AsyncOp id immediately so timeout code will not run
Branch: v3.2
https://github.com/mongodb/mongo/commit/6a29ec259b45f713b74b1fcd2a033918b0983dbb

Comment by Samantha Ritter (Inactive) [ 20/Sep/16 ]

wxiaoguang@gmail.com, as we investigate this, if you are able to give me some information about your environment and workload that would be very helpful.

  1. How frequently have you seen this crash?
  2. Is it something you are able to reproduce consistently?
  3. If so, what sort of workload are you running when you encounter the bug?
  4. Would you please upload an archive of the diagnostic.data directory from the dbpath of a primary node?
  5. Are you running a replica set or a sharded cluster? How many nodes?
  6. Would you please upload the complete logs from the crashed server?

I've created a secure upload portal for you to use. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thank you,
Sam

Comment by Samantha Ritter (Inactive) [ 20/Sep/16 ]

Hi there, just an update on this ticket. I believe I've identified the root cause of the issue and we're currently reviewing and testing a fix for it.

Generated at Thu Feb 08 04:11:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.