[SERVER-28629] router blocks and throws ExceededTimeLimit Created: 04/Apr/17  Updated: 08/Jan/24  Resolved: 22/Jun/17

Status: Closed
Project: Core Server
Component/s: Networking, Sharding
Affects Version/s: 3.2.12
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Mira Carey
Resolution: Duplicate Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File figure_1.png     JPEG File fr-11_tcpwait.jpg     JPEG File fr-11_tcpwaitOnly.jpg     JPEG File tcp-tw_v3.0.12.jpg     JPEG File tcp_timewait_3.2.12vs3.2.8.jpg     JPEG File v3.2.12_latencies.jpg     JPEG File v3.2.12_tcp_tw.jpg     JPEG File v3.2.8_latencies.jpg     JPEG File v3.2.8_tcp_tw.jpg    
Issue Links:
Duplicate
duplicates SERVER-29237 Add maxConnecting in asio connpool Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

We can reproduce the issue at any time just by executing a findOne through the router several times:

for(x=0;x<1000;x++){db.offer.find({"_id" : NumberLong("5672494983")}).forEach(function(u){printjson(u)});print(x)}

It blocks after a few findOne's already.
If we execute the same code on the shard where the document is located then there is no blocking at all.

Participants:

 Description   

We are using a new sharded cluster running v3.2.12. Our cluster is not operational because many operations get blocked by the the router. The corresponding log message looks like this:

2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-066.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out

We observe this behaviour independent on whether the query uses the shardkey or not. In all cases the queried field is indexed.

A downgrade of the routers to v3.0.12 ist not possible because our configservers are running as replicaset instead of a mirrored set.
An upgrade of the routers to v3.4.3 is not possible because "Version 3.4 mongos instances cannot connect to earlier versions of mongod instances."
https://docs.mongodb.com/manual/release-notes/3.4-compatibility/

Please see also 2 monitoring screenshots of the router TCP-sockets. As you can see, tcp_tw (tcp_timeWait) is very high.

This ticket is related to SERVER-26722 which has been closed as "resolved and fixed in 3.2.12" but since we still have this issue, we've create this new ticket for it.



 Comments   
Comment by Mira Carey [ 22/Jun/17 ]

Now that we've picked a release for the backport of SERVER-29237 (for 3.2.15, no date yet, but it's the next one up), I'm marking this ticket as a duplicate of that and closing this out.

While you're welcome to stick with other workarounds if they're stable for you, maxconnecting may also be a viable option. That will depend on whether the initial latency of spinning up new connections is tolerable once we drive down connection churn.

If you have any questions, feel free to re-open or open a new ticket as needed.

Comment by Hyun Gul Roh [X] [ 17/May/17 ]

Thanks, @kay.agahd If the connections from clients are stable, all the time_wait are caused by closing the connections to mongodata?

I'd like to ask a question mongoDB engineers.
According to the analysis above, only one mongos from 5 mongos got stalled.
So, some of my team members is suspecting that this issue is caused by abnormal server selections of java-drivers. Is this feasible?
There is no evidence that the other mongos are busy, considering the logs.

Comment by Kay Agahd [ 16/May/17 ]

BTW, the meaning that the number of connections remained stable is that there are no more connections from clients to mongos (router) ?

Yes.
The connections were kept open. If the server blocked too long, the connection may have been closed server- or client-side. However connections got rarely blocked that long but beeing blocked for a few seconds is painful already.

Comment by Hyun Gul Roh [X] [ 16/May/17 ]

Thank, kay.agahd.
BTW, the meaning that the number of connections remained stable is that there are no more connections from clients to mongos (router) ?
If a client continuously connects and closes, the number of connections from a client can remain stable.

If the connection from clients are no more made, our case seems different with this issue.
Rather than, analyzing the log file of mongos presented in https://jira.mongodb.org/secure/attachment/142286/log-20161012-app0.gz of the issue https://jira.mongodb.org/browse/SERVER-26654 , almost one client made more than 1600 connections to mongos.
So, I am wondering if those connections are kept or not.

Anyway, thank you for your recommendation.

Comment by Kay Agahd [ 16/May/17 ]

Hi Hyun Gul Roh,

when you have time_wait, can you tell us if the connections from clients are closed by mongos?

The number of connections remained stable so I think the connections got not closed.

Can you give us any advice to change mongoDB?

We are using mongos v3.2.12 with following added parameters:

setParameter=ShardingTaskExecutorPoolHostTimeoutMS=3600000
setParameter=ShardingTaskExecutorPoolMinSize=100


Since then, no crashes, no blocking. All seems to be fine so far.

Comment by Hyun Gul Roh [X] [ 16/May/17 ]

Hi, all.

We've got very similar issues on our mongoDB cluster.
Here is our analysis based on the log we have.
At the end of the analysis, we are also giving some questions.

  • SETTINGS
  • 40 clients using mongo-java-driver v3.2.0 with default configuration.
  • 5 mongoS machines with v3.2.11
  • 6 primary mongoD v3.2.11 + 6 secondary mongoD v3.2.11
  • deployed as CSRS
  • OS: Centos 6.7 kernel 2.6.32-642.13.1.el6.x86_64 #1 SMP
  • glibc 2.12
  • SYMPTOM
  • At 24 Apr 2017, between 19:55:30~19:55:56, two clients crashed due to kernel panic.
  • From 24 Apr 2017 19:59:30, clients have had com.mongodb.MongoExecutionTimeoutException for 9m 13sec.
  • Let us call 5 mongoS machines as mongoS01, mongoS02, ... , mongoS05
  • Only mongos05 begins to have "Failed to connect to mongodata* - ExceedTimeoutLimit: Operation timed out" from 24 Apr 2017 19:59:29 ~ 24 Apr 2017 20:08:35 as the following histogram, and after then comes back to be normal state without our intervention.

fig1

  • The followings are the histogram the logs "connection accepted from XXXX #YYY (ZZZZ connections now open)" @ mongos01~mongos05.

fig2

  • You can find that mongos05 have too many connections from clients between 19:59:04~19:59:39 compared to the other mongos.
  • Looking into the logs of "connection accepted.." @ mongos05, each client's connection to mongoS05 was given as follows in descending order:

fig3

  • Mongos05 has only 5 logs of "end connection ..." @ 20:05:58, 20:06:12, 20:06:48, 20:06:48 and 20:07:23.
  • When the issue occurred, mongos01~mongos05 have had the following network traffic that are rather less than normal state.

fig4

  • Mongos made connections with 6 mongoD as following histogram according to the logs "Successfully connected to mongodata* ..." ,

fig5

  • The followings are the time_wait count @ mongoS05, which means that mongoS05 was closing connections.

fig6

  • One of primary mongoD has the logs of "connection accepted from mongos05" as following histogram. (The other mongoD machines have similar situations)

fig7

  • The same mongoD had the logs of "end connection mongos05" as following histogram

fig8

  • The same mongoD had the following number of logs "end connection" with the other mongoS machines as follows, which means that the connections with mongos05 were continuously closing.

mongos02 690
mongos05 20205
mongos04 881
mongos03 513
mongos01 648

  • QUESTIONS
    1. Some of our team members suspect that mongodb-java-drivers on clients incur this problem. So, they are suggesting changing configuration of java-drivers such as server selection options. Is this good idea, and if then, can you recommend some configuration?
    2. Looking at the time_wait graph, our mongos05 actively was closing connections. @kay.agahd, when you have time_wait, can you tell us if the connections from clients are closed by mongos?
    3. Our system manager don't want to upgrade glibc version because they had gone through fatal side-effects when they changed it in other systems. Also, due to the issue of https://jira.mongodb.org/browse/SERVER-24386, our DB manager is hesitating to downgrade the version to 3.2.8. So, to avoid this issue, we have to decide something. Can you give us any advice to change mongoDB?
Comment by Jon Hyman [ 11/May/17 ]

To follow up on Wayne's post, when this is happening the mongoS by and large stops responding to queries and we get operation failures with "Couldn't get a connection within the time limit (50)" – this by and large takes down our application. What other potential options do we have for mitigation? Here is our glibc setting, so we don't think it's that.

  1. uname -r
    3.13.0-85-generic
  2. ldd --version
    ldd (Ubuntu EGLIBC 2.19-0ubuntu6.7) 2.19
Comment by Wayne Egerer [ 11/May/17 ]

Hey @Jason Carey,
We attempted the above attempts to fix this same issue that we are experiencing; though the issue still happens typically within 5 minutes of applying the changes. So its happening both with and without the changes.

We are running mongos v3.2.12

The cluster is busy; so we suspect this is why we see the issue while others may not be with the config changes recommended.

2017-05-09T18:13:21.762+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to c0-r9-data-001.mongo:27017 - ExceededTimeLimit: Operation timed out
2017-05-09T18:13:21.762+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to c0-r0-data-003.mongo:27017 - ExceededTimeLimit: Operation timed out
2017-05-09T18:13:21.762+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to c0-r0-data-003.mongo:27017 - ExceededTimeLimit: Operation timed out

Comment by Kay Agahd [ 08/May/17 ]

Hi mira.carey@mongodb.com, we followed your suggestion and tried 2 different configurations one by one while running mongos v3.2.12
Config 1: we multiplied the default values by 12 resulting in:

setParameter=ShardingTaskExecutorPoolHostTimeoutMS=3600000
setParameter=ShardingTaskExecutorPoolRefreshRequirementMS=720000

After 12 minutes runtime, queries got blocked and tcp-timewait increased. So we had to stop this mongos and tried the next config:

Config 2: we multiplied the default value of ShardingTaskExecutorPoolHostTimeoutMS by 12 and ShardingTaskExecutorPoolMinSize by 100 resulting in:

setParameter=ShardingTaskExecutorPoolHostTimeoutMS=3600000
setParameter=ShardingTaskExecutorPoolMinSize=100

Mongos did run more than 1 hour without blocking. No tcp-timewait spikes were observed. So it seems that the second config could be a workaround.

We could also try to set the values of the first config to extremely high values, say 1 year or something alike, so that the spikes would only happen after one year if we didn't restart the router in the meantime (which happens normally earlier).

Comment by Antonis Giannopoulos [ 08/May/17 ]

Hi,

We are having the same issue, (a more detailed description on SERVER-28232) and we deployed in production (version 3.2.12) one of the workarounds.

ShardingTaskExecutorPoolHostTimeoutMS: 300000000
ShardingTaskExecutorPoolMinSize: 100

The ShardingTaskExecutorPoolMinSize came from the fact that the shards were maintain around 2000 connections prior to the change. We are running with 4 mongos and each mongos has 6 CPUs (so 6 connection pools) so I am expecting to have at least 4*6*100=2400 connections which is close to what the cluster needs.

There is no good reason why we picked ShardingTaskExecutorPoolHostTimeoutMS= 300000000. It was the only value that during the benchmarks in our test envairoment didn't produced the exception but as the test continues we might be able to determine a lower value for it.

So far I haven't noticed any "Failed to connect to - ExceededTimeLimit: Operation timed out" on the logs but its only 4 hours since we made the change. I will have some feedback after 24-48 hours and when the ShardingTaskExecutorPoolHostTimeoutMS gets exhausted.

Ant

Comment by Kay Agahd [ 05/May/17 ]

Hi mira.carey@mongodb.com thank you for the helpful information. We will test your suggested configuration next week and report back.

For your information: we can reproduce the issue with mongos v3.2.9. This means that the bug must have been introduced in v3.2.9.

Comment by Mira Carey [ 04/May/17 ]

kay.agahd@idealo.de,

If you're having success with with 3.2.8, it's likely that you're benefiting from the connection pool never shrinking and never heartbeating (forcing a task to way while the heartbeat occurs), and therefore, never experiencing the cost of creating a new connection to a loaded mongod.

You can replicate that behavior with a set of setparams that were made available starting in 3.2.11 (via SERVER-25027)

ShardingTaskExecutorPoolHostTimeoutMS

  • This controls how long we'll keep connections to a host around without any traffic.
  • It defaults to 5 minutes

ShardingTaskExecutorPoolRefreshRequirementMS

  • This controls how long we'll wait to heartbeat a connection in the pool (when one is idle)
  • It defaults to 1 minute

ShardingTaskExecutorPoolMinSize

  • This controls the minimum number of connections to maintain to a host
  • It defaults to 1

Setting the host timeout and refresh requirement to very high values will effectively reproduce the behavior of the connection pooling in 3.2.8

Another option would be to increase the host timeout and min pool size. This would limit connection churn from spiky traffic by keeping connections around through lulls.

Comment by Kay Agahd [ 19/Apr/17 ]

Hi acm, I just forgot to confirm that you're right with your definition of the "bad" and the "good" cluster made on Apr 13 2017 08:48:23

Comment by Kay Agahd [ 19/Apr/17 ]

Hi acm I'm on sick leave until end of this week. My colleague may follow up if his time allows.
To answer your questions nevertheless:
Turning up a 3.2.10 mongos in the "good" cluster wil result in ExceededTimeLimit errors. This has been reported by me in SERVER-26722 . That was the reason that we replaced the v3.2.10 routers by v3.0.12 at the time. In the meantime we've found out that mongos v3.2.8 routers don't throw ExceededTimeLimit errors so we upgraded our v3.0.12 routers in our "good" cluster to mongos v3.2.8. We are aware that v3.2.8 may crash due to a bug resolved in v3.2.10 but we are able to work around this issue by setting mongos maxConns to 2500. This may result in refused connections during connection spikes but at least the routers do neither crash nor block anymore.
If time allows, my colleague will turn up a 3.2.9 mongos in the "bad" cluster and report back.

Comment by Andrew Morrow (Inactive) [ 18/Apr/17 ]

Hi kay.agahd@idealo.de - I'm following up on this ticket because I haven't heard anything back from you in a few days. Do you have any further information regarding the questions I asked last week? Will you be able to perform any of the experiments I suggested (I understand that you might not be able to)? I'll be working again today to try to reproduce your issue, so any additional information you can provide would be valuable.

Comment by Andrew Morrow (Inactive) [ 13/Apr/17 ]

kay.agahd@idealo.de - So far, I have not been able to reproduce the behavior you have observed, though I will keep trying. In an effort to continue to gather data, would you be in a position to run either or both of the following experiments:

  • In the "good" cluster, turn up a 3.2.10 mongos, and see if you can reproduce the ExceededTimeLimit errors when communicating with that mongos. I realize that this is likely your production cluster so I understand my you may be reluctant or unable to do so.
  • In the "bad" cluster, turn up a 3.2.9 mongos, and see if you can reproduce with that.

The first test will let us see if this issue reproduces in both SCCC and CSRS environments, and the second will bisect some of the range over which this apparent regression exists.

Comment by Andrew Morrow (Inactive) [ 13/Apr/17 ]

kay.agahd@idealo.de - I overlooked that you had downgraded mongos from 3.2.12 to 3.2.8 in the cluster I named as "Bad" above. Presumably, the "Bad" cluster is actually working fine for you at that version, and you only reproduce the ExceededTimeLimit errors if you advance mongos to 3.2.10 or 3.2.12.

Comment by Andrew Morrow (Inactive) [ 13/Apr/17 ]

kay.agahd@idealo.de - Thank you for providing that. I'd like to clarify/confirm one thing. Am I correct that in both the "good" cluster, and the "bad" cluster, all of the mongos instances are running 3.2.8? The differences between the clusters (ignoring, for the moment sizes, hardware variations, VMs, etc.), based on your notes above, appear to be:

The "Good" cluster:

  • Using SCCC.
  • Versions of mongod in the replica sets is mixed between 3.2.10 and 3.2.11.
  • Versions of mongos are 3.2.8.
  • Three config servers (SCCC) at 3.2.10.

The "Bad" cluster:

  • Using CSRS.
  • Versions of mongod in the shard replica sets is 3.2.12.
  • Versions of mongod in the CSRS is 3.2.12.
  • Versions of mongos are 3.2.8.

In other words, the version of mongos does not vary between the good cluster and the bad cluster, but the version of mongod does. Is that correct, or have I misinterpreted your above writeup somehow?

Comment by Kay Agahd [ 12/Apr/17 ]

Hi acm, thank you for your quick reply. I'm happy to help to get this bug fixed.

  • What is the cluster topology for the 3.2.8 run (number of shards, number of replicas in each shard, number of mongos, etc.)?
    4 shards, each one has 3 replica members, each replica is a bare metal machine having 128 GB RAM, 750 GB SSD, number of CPU cores is different (24, 48 and 56), CPU frequency is also different (1700 and 2400 MHz), mongod's version is mixed too (3.2.10 / git version:79d9b3ab5ce20f51c272b4411202710a082d0317 and 3.2.11 / git version: 009580ad490190ba33d1c6253ebd8d91808923e4)
    6 mongos'es, all running v3.2.8 / git version: ed70e33130c977bda0024c125b56d159573dbaf0 in a VM, debian7 64bit, each one having 8 GB RAM, 20 GB SSD, 4 CPU cores
    3 config servers SCCC, v3.2.10 / git version: 79d9b3ab5ce20f51c272b4411202710a082d0317 in a VM, debian7 64bit, each one having 4 GB RAM, 20 GB SSD, 2 CPU cores
  • What is the cluster topology for the 3.2.12 run, if different from above?
    3 shards, each one has 3 replica members, each replica is a bare metal machine having 128 GB RAM, 750 GB SSD, 56 CPU cores, mongod's version is v3.2.12 /
    git version: ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14
    3 mongos'es, all running v3.2.8 / git version: ed70e33130c977bda0024c125b56d159573dbaf0 in a VM, debian7 64bit, each one having 8 GB RAM, 20 GB SSD, 4 CPU cores
    3 config servers CSRS mongod's version is v3.2.12 / git version: ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14 running on the same hosts as the mongos routers
  • Were the 3.2.8 and 3.2.12 runs executed on the same systems? Or is the 3.2.12 environment different? If so, how?
    They are physically different systems, see above their specs
  • How is the 'offers' collection configured (re sharding, indexes, etc.)?
    offer collection is sharded on _id, in the old cluster _id is NumberLong and in the new cluster _id is a hashed String (see example below), each document has typically between 2000 and 2500 Bytes (see example below), there are 13 indexes (see index stats below), total number of documents is 215 million in offer collection, there are no other collections or databases (besides local, config and admin)
  • What does a document in the 'offers' collection look like, typically?
    see below

All our mongodb clusters are running with authentication (per keyfile) and storage engine wired tiger.

Our cluster is in mms. Our group name is idealo. The cluster name is offerStoreFR - but mms shows the new and the old cluster wrongly as the same cluster consisting of 7 shards. This is wrong! The old cluster (running mixed mongod's v3.2.10 and 3.2.11 and mongos v3.2.8, being in production) is constituted of the following shards:

  • offerStoreFR
  • offerStoreFR2
  • offerStoreFR3
  • offerStoreFR4

The new cluster (running mongod's v3.2.12 and at the time mongos v3.2.12, downgraded now to v3.2.8, not yet in production because it's still syncing from the old cluster) is constituted of the following shards:

  • offerStoreFR01
  • offerStoreFR02
  • offerStoreFR03

Also, mms wrongly shows only the 3 router and configservers of the new cluster. I've just discovered that mms shows the router and config servers of the old cluster under the cluster name "Cluster_4".

Example offer document, sensible content replaced by X:

{
        "_id" : "000003d4ae4e37667ce203e083c43320",
        "XXX" : {
                "XXX" : "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
        },
        "XXX" : NumberLong(1140),
        "XXX" : ISODate("2017-03-10T03:44:15Z"),
        "XXX" : "EUR",
        "XXX" : {
                "0" : "XXX"
        },
        "XXX" : {
                "0" : "XXX"
        },
        "XXX" : NumberLong(3751),
        "XXX" : {
                "0" : "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
        },
        "XXX" : "XXXXX",
        "XXX" : "XXXXXXXXXXXXXXXXXXXXXXXX",
        "XXX" : null,
        "XXX" : "XXXXX",
        "XXXXXX" : {
                "0" : "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
        },
        "XXXXXXX" : ISODate("2017-04-02T05:24:20Z"),
        "XXXXX" : {
                "0" : "XXXX"
        },
        "XXXX" : "XXXX",
        "XXXX" : NumberLong(112),
        "XXXX" : false,
        "XXXX" : [
                "XXXXXXXXXXXXXXXX"
        ],
        "XXX" : "XXXXXXXXXXXXXXXXX",
        "XXXXX" : "XXXXXXXXXXXXXXXXXXXXX",
        "XXXXXX" : "XXXXXXXXXXXXXXXXXXXXXXX",
        "XXXX" : {
                "0" : "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
        },
        "XXXX" : {
                "0" : 23.9
        },
        "XXXXX" : "XXXXXXXXXXXXXXXX",
        "XXXXXX" : "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
        "XXXXX" : NumberLong(10794),
        "XXXXXXXXXXX" : "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
        "XXXXXX" : NumberLong(1255380132),
        "XXXXXXXXXXXXX" : NumberLong(10794),
        "XXXXXXXXXXXXXXXXX" : {
                "0" : [
                        "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
                ]
        },
        "XXXXXXXXXXXXXXX" : {
                "0" : {
                        "XXX" : 0
                }
        },
        "XXXXXXXXXXXXXXXXXXXXXXXXXX" : {
                "0" : "XXX"
        },
        "XXX" : {
                "0" : "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
        },
        "XXXXXXXXXXXXXXXXXXX" : {
                "21781" : [
                        NumberLong(82834),
                        NumberLong(82830)
                ]
        },
        "XXXX" : NumberLong(281836),
        "XXXXXXXXXX" : ISODate("2017-04-02T05:21:51Z"),
        "XXXXXXXX" : "000003d4ae4e37667ce203e083c43320",
        "XXXXX" : NumberLong(1)
}

Index stats in Bytes of offer collection for only 1 of 4 shards, sensible content replaced by XXX:

 "nindexes" : 13,
"totalIndexSize" : 13121323008,
"indexSizes" : {
		"_id_" : 2464993280,
		"XXX_1" : 1105694720,
		"XXX_1" : 2987773952,
		"XXX_1" : 356233216,
		"XXX_1" : 394670080,
		"XXX_1" : 425984,
		"XXX_1_XXX_1" : 579452928,
		"XXX_1_XXX_1_XXX_1" : 416321536,
		"XXX_1" : 239517696,
		"XXX_1" : 3454382080,
		"XXX_1" : 4096,
		"XXX_1_XXX_1_XXX_1" : 757325824,
		"XXX_1_XXX_1" : 364527616
},


Stats in Bytes of offer collection for all 4 shards:

"sharded" : true,
"capped" : false,
"ns" : "offerStore.offer",
"count" : 217107142,
"size" : 506783746588,
"storageSize" : 253317558272,
"totalIndexSize" : 44255559680,
"indexSizes" : {
		"_id_" : 3282075648,
		"XXX_1" : 1052590080,
		"XXX_1" : 10061955072,
		"XXX_1" : 1385840640,
		"XXX_1" : 1140592640,
		"XXX_1_XXX_1_XXX_1" : 1617039360,
		"XXX_1" : 1813213184,
		"XXX_1_XXX_1" : 2571128832,
		"XXX_1" : 3212742656,
		"XXX_1" : 13279195136,
		"XXX_1_XXX_1" : 1718837248,
		"XXX_1_XXX_1_XXX_1" : 3120332800,
		"XXX_1" : 16384
},
"avgObjSize" : 2334.2564501539982,
"nindexes" : 13,
"nchunks" : 48957,

Comment by Andrew Morrow (Inactive) [ 12/Apr/17 ]

kay.agahd@idealo.de - Another potentially interesting bit of information would be whether your reproducer above exhibits the high latencies with 3.2.9, as that would help narrow the search space from 3.2.8..3.2.10 to one of 3.2.8..3.2.9 or 3.2.9..3.2.10, even before I am able to get a local reproduction.

Comment by Andrew Morrow (Inactive) [ 12/Apr/17 ]

kay.agahd@idealo.de - Thank you for the detailed writeup. I am going to attempt to reproduce what you are seeing on my side. If I am able to do so, it should be fairly easy to do a bisect between 3.2.8 and 3.2.12 to understand what change has caused this performance regression.

In an effort to produce as faithful a reproduction environment as possible, could you please provide some additional information? I do apologize if this has been asked and answered elsewhere already, but as there is a somewhat muddled ticket history here, I think it would be good to ensure that this information is all captured in one place:

  • What is the cluster topology for the 3.2.8 run (number of shards, number of replicas in each shard, number of mongos, etc.)?
  • What is the cluster topology for the 3.2.12 run, if different from above?
  • Were the 3.2.8 and 3.2.12 runs executed on the same systems? Or is the 3.2.12 environment different? If so, how?
  • How is the 'offers' collection configured (re sharding, indexes, etc.)?
  • What does a document in the 'offers' collection look like, typically?
Comment by Kay Agahd [ 12/Apr/17 ]

Hi acm, to answer your question, yes, the cluster behavior with respect to ExceededTimeLimit errors in 3.2.12 is identical to what we observed in 3.2.10. This is true as well as from a server side point of view because not only the error messages are identical, also the abnormally high tcp-timewait spikes are identical. It's also true from a client side point of view because execution times of find queries are abnormally high or even timed out.

We found out that mongos v3.2.8 does not have this issue. We know that v3.2.8 may crash due to SERVER-25465 but we could avoid this issue by setting maxConns to 3000.
I just made a quick test to compare mongos v3.2.12 with v3.2.8. The test consists in finding the same document 10000 times in a loop. It prints out the maximal latency and total duration of the test. I repeated the test 3 times for each mongos version.
Here is v3.2.12:

2017-04-12T16:00:10.363+0200 I NETWORK  [thread1] trying reconnect to offerstore-fr-router-10.db00.pro05.eu.idealo.com:27017 (10.135.128.209) failed
2017-04-12T16:00:10.385+0200 I NETWORK  [thread1] reconnect offerstore-fr-router-10.db00.pro05.eu.idealo.com:27017 (10.135.128.209) ok
mongos> db.version()
3.2.12
mongos> var totalStart=new Date();var start=new Date(); var end=new Date();var max=0;var min=999999;for(x=0;x<10000;x++){start=new Date();db.offer.find({"_id" : "000003d4ae4e37667ce203e083c43320"},{_id:1}).forEach(function(u){end = new Date();var dur=end.getTime()-start.getTime();if(dur>max){max=dur;print("new max: " + max)};if(dur<min){min=dur};})};print("ms min: " + min + " max: " + max + " total: " + ((new Date()).getTime()-totalStart.getTime()))
new max: 11
new max: 12
new max: 14
new max: 193
ms min: 10 max: 193 total: 108896
mongos> var totalStart=new Date();var start=new Date(); var end=new Date();var max=0;var min=999999;for(x=0;x<10000;x++){start=new Date();db.offer.find({"_id" : "000003d4ae4e37667ce203e083c43320"},{_id:1}).forEach(function(u){end = new Date();var dur=end.getTime()-start.getTime();if(dur>max){max=dur;print("new max: " + max)};if(dur<min){min=dur};})};print("ms min: " + min + " max: " + max + " total: " + ((new Date()).getTime()-totalStart.getTime()))
new max: 18
new max: 19
new max: 24
new max: 25
ms min: 10 max: 25 total: 108573
mongos> var totalStart=new Date();var start=new Date(); var end=new Date();var max=0;var min=999999;for(x=0;x<10000;x++){start=new Date();db.offer.find({"_id" : "000003d4ae4e37667ce203e083c43320"},{_id:1}).forEach(function(u){end = new Date();var dur=end.getTime()-start.getTime();if(dur>max){max=dur;print("new max: " + max)};if(dur<min){min=dur};})};print("ms min: " + min + " max: " + max + " total: " + ((new Date()).getTime()-totalStart.getTime()))
new max: 10
new max: 11
new max: 12
new max: 14
new max: 18
new max: 25
new max: 223
ms min: 10 max: 223 total: 108521


Here is v3.2.8:

2017-04-12T16:07:09.692+0200 I NETWORK  [thread1] trying reconnect to offerstore-fr-router-10.db00.pro05.eu.idealo.com:27017 (10.135.128.209) failed
2017-04-12T16:07:09.716+0200 I NETWORK  [thread1] reconnect offerstore-fr-router-10.db00.pro05.eu.idealo.com:27017 (10.135.128.209) ok
mongos> db.version()
3.2.8
mongos> var totalStart=new Date();var start=new Date(); var end=new Date();var max=0;var min=999999;for(x=0;x<10000;x++){start=new Date();db.offer.find({"_id" : "000003d4ae4e37667ce203e083c43320"},{_id:1}).forEach(function(u){end = new Date();var dur=end.getTime()-start.getTime();if(dur>max){max=dur;print("new max: " + max)};if(dur<min){min=dur};})};print("ms min: " + min + " max: " + max + " total: " + ((new Date()).getTime()-totalStart.getTime()))
new max: 11
new max: 16
new max: 17
new max: 24
ms min: 10 max: 24 total: 108825
mongos> var totalStart=new Date();var start=new Date(); var end=new Date();var max=0;var min=999999;for(x=0;x<10000;x++){start=new Date();db.offer.find({"_id" : "000003d4ae4e37667ce203e083c43320"},{_id:1}).forEach(function(u){end = new Date();var dur=end.getTime()-start.getTime();if(dur>max){max=dur;print("new max: " + max)};if(dur<min){min=dur};})};print("ms min: " + min + " max: " + max + " total: " + ((new Date()).getTime()-totalStart.getTime()))
new max: 11
new max: 12
new max: 13
new max: 14
new max: 22
new max: 23
new max: 24
ms min: 10 max: 24 total: 108742
mongos> var totalStart=new Date();var start=new Date(); var end=new Date();var max=0;var min=999999;for(x=0;x<10000;x++){start=new Date();db.offer.find({"_id" : "000003d4ae4e37667ce203e083c43320"},{_id:1}).forEach(function(u){end = new Date();var dur=end.getTime()-start.getTime();if(dur>max){max=dur;print("new max: " + max)};if(dur<min){min=dur};})};print("ms min: " + min + " max: " + max + " total: " + ((new Date()).getTime()-totalStart.getTime()))
new max: 12
new max: 17
new max: 23
new max: 25
ms min: 10 max: 25 total: 108225
mongos>


As you can see, the latencies of mongos v3.2.8 are never above 25 ms whereas mongos v3.2.12 may reach latencies of over 200 ms.
Please see also the screenhot tcp_timewait_3.2.12vs3.2.8.jpg which clearly shows that the tcp-timewait spikes were much higher (nearly 500) for v3.2.12 compared to the tcp-timewaits of v3.2.8 (max 72). You can see the timestamps on the x-axis and correlate them with the timestamps of my tests when the mongos reconnected due to monogs restart.

Right now, we are using mongos v3.2.8 in production. The difference to mongos v3.2.12 is more than obvious: please compare both screenshots of tcp-timewaits v3.2.8_tcp_tw.jpg versus v3.2.12_tcp_tw.jpg. This are screenshots of our production system. The maximum of v3.2.8 is 4 whereas the max of v3.2.12 is 6000!
The difference of latencies from a client side point of view are also spectacular: v3.2.8_latencies.jpg shows findOne latencies of max 5 ms whereas v3.2.12_latencies.jpg are over 3000 ms. These latencies are the reason that mongos v3.2.12 is unusable in production.

Comment by Kay Agahd [ 11/Apr/17 ]

Hi anonymous.user,

here are the values you've asked for:

[21:28:05]root@offerstore-fr-router-11.db00.pro05 /home/admin# uname -rv
3.2.0-4-amd64 #1 SMP Debian 3.2.60-1+deb7u3
[21:28:18]root@offerstore-fr-router-11.db00.pro05 /home/admin# ldd --version
ldd (Debian EGLIBC 2.13-38+deb7u11) 2.13
Copyright © 2011 Free Software Foundation, Inc.
Dies ist freie Software; in den Quellen befinden sich die Lizenzbedingungen.
Es gibt KEINERLEI Garantie; nicht einmal für die TAUGLICHKEIT oder
VERWENDBARKEIT FÜR EINEN ANGEGEBENEN ZWECK.
Written by Roland McGrath and Ulrich Drepper.
[21:28:25]root@offerstore-fr-router-11.db00.pro05 /home/admin#

Thank you.
PS: acm, I'll answer you tomorrow if time allows.

Comment by Kelsey Schubert [ 11/Apr/17 ]

Hi kay.agahd@idealo.de,

I understand you are using debian71, would you please provide the following output so we can rule out some theories that only affect specific versions of glibc?

  • uname -rv
  • ldd --version

Thank you,
Thomas

Comment by Andrew Morrow (Inactive) [ 11/Apr/17 ]

Hi kay.agahd@idealo.de -

I'll be filling in for Jason for a little while as he is currently unavailable. I'd like to take a step back from the prior discussion and ask a higher level question.

Previously, you had filed SERVER-26722, with the same title as this ticket. That ticket was closed as a duplicate of SERVER-27232, and the penultimate message mentions that SERVER-27232 had been fixed; that fix was later backported to 3.2, and was released in 3.2.12. You tried 3.2.12, continued to get connection ExceededTimeLimit errors, and opened this ticket after noting that SERVER-26722 had been closed.

Concurrently with the above, SERVER-26859 was in progress, and also affected 3.2.10. That ticket described a situation where threads from the asynchronous networking layer were being erroneously blocked in a callback, leading to ExceededTimeLimit connection errors. Your prior ticket, SERVER-26722, is in fact linked as related to SERVER-26859.

The fix for SERVER-26859 addressed one specific instance of an asynchronous networking thread being illegally blocked in callback. But what I'm very curious to know is whether the overall cluster mis-behavior and syndrome reported in this ticket (with 3.2.12), is in a general way observably distinct from the behavior observed in SERVER-26722 (with 3.2.10). Or is the cluster behavior with respect to ExceededTimeLimit errors in 3.2.12 generally identical to what you observed in 3.2.10?

I ask all this because if the type of issue fixed in SERVER-26859 leads to the symptoms observed in SERVER-26722, and if, after fixing SERVER-26722, the symptoms persist identically, it seems necessary to consider whether other issues of the same class as SERVER-26859 may still exist in 3.2.12.

I have asked the engineers who worked on SERVER-26859 to provide me with some additional context on how the issue was discovered and diagnosed. If you have any thoughts on whether this hypothesis aligns with your understanding of the timeline and observed cluster behavior, that information would be most welcome.

Comment by Kay Agahd [ 09/Apr/17 ]

Thanks antogiann for the suggestion. We tried it out but it did not solve the problem. We still had ExceededTimeLimit errors.

mira.carey@mongodb.com you mean that you're interested in seeing what our mongos traffic in the v3.2 looks like. Well, since v3.2 mongos routers are simply not operational, we can't use them in production, so we can't give you more production logs. However, we could give you our logs from our mixed mongo cluster (monogd's = v3.2.10 and v3.2.11, configServer SCCC and mongos'es = v3.0.12).
We could also setup a pure v3.2.12 test cluster (configServer CSRS), running dummy queries against and send you the log files.
Which of both would be helpful for you? Is the default log level enough for this?

Comment by Antonis Giannopoulos [ 09/Apr/17 ]

On the instance we managed to reproduce SERVER-28232, the issue gone away with the following settings

ShardingTaskExecutorRefreshTimeoutMS = 60 seconds
ShardingTaskExecutorPoolMaxSize = 1

Unfortunately in production the same settings didn't had the desired result. I assume is not related with the load in the shards as my benchmark stress the instance with 10X more traffic than the production.

Comment by Mira Carey [ 05/Apr/17 ]

kay.agahd@idealo.de,

Responding inline below:

Your observation of client spikes from 15:17 to 15:36 is expected because at this time we moved the client application from accessing the old mongodb cluster to the new mongodb cluster. In general, our mongos'es have around 2k-3k connections each. Some mongos routers running mongodb v3.0.12 had already spikes upto 6k connections without any ExceededTimeLimit exception. Please see our tcp_timewait screenshot "tcp-tw_v3.0.12.jpg" of a mongos router running v3.0.12 which handles even more connections than the blocking router running v3.2.12 for which we posted yesterday the screenshot "fr-11_tcpwaitOnly.jpg".
Right now, since the new mongodb v3.2.12 cluster is not operational, we moved back our client application to the old cluster which does not encounter any of the blocking behaviours seen on the new v3.2.12 cluster.

Thanks for the context, that makes sense.

As you can see in "Steps to reproduce" above, the query which may block by throwing ExceededTimeLimit errors in v3.2.12 is querying the _id field, which is also the shardkey. We can reproduce the issue by using another indexed, unique field, called "offerId".

I'd be interested in seeing what your mongos logs look like when that's the only traffic in your 3.2 cluster. What I see in your attached logs is a lot of parallel clients, which is a different (and also important to us) issue.

We strongly think that this issue has to do with mongodb routers running v3.2.x because we are running 6 different mongodb clusters of over 100 bare metal machines, all having the same workload but only those which are using mongos routers v3.2 encounter ExceededTimeLimit errors. This makes it impossible for us to use mongodb v3.2 in production.
Thanks for your investigations to fix this bug!

I don't disagree with you, you're definitely running into a different operational profile between 3.0 and 3.2. My experience with 3.2 mongos has been that it's much more sensitive to spikes in demand than 3.0. Where 3.0 used a few more threads, and had a few more locks, 3.2 tends to jump on a request and immediately spin up a lot of connections. Most of the pain I've seen in 3.2 deployments has been a result of systems that are stable when load is regular and even, but overwhelms particular shards when load spikes.

In looking at your logs, I'm trying to figure out if the exceededTimeLimit's you're seeing are real and correct (in which case the problem lies in how mongos spools up), or if they're false (firing too early, logging incorrectly, etc). Without expanded logging, I have to make that inference based on client load and whether successful connects are mixed in. In this case, you do have load, you do have successful connects mixed in and there's nothing overly suspicious in there. That inclines me to suggest some of the operational workarounds that have worked for other users.

For the other vein, we've backported SERVER-28259 to 3.2, although I don't currently have a timeline on if/when we'll do a 3.2.13. That adds some logging on how long connections are actually taking to establish, which would go a long way towards validating my operational hypothesis.

Comment by Kay Agahd [ 05/Apr/17 ]

Hi mira.carey@mongodb.com thank you for your detailed analysis! Let me answer your questions and explain our workload.
The cluster running mongodb v3.2.12 is a new, freshly installed cluster, which is syncing per mongo-connector by using our self written doc-manager, from another mongodb cluster whose routers are running mongodb v3.0.12. The mongod's of the old cluster are running mongodb v3.2.11 - but the mongos routers had to be downgraded to v3.0.12 in order to avoid ExceededTimeLimit errors which I have described already in SERVER-26722 . Since our access patterns have changed over the last months, we need to change our shardkey. This is the reason why we have installed a new mongodb cluster, keeping it in sync by using mongo-connector.

Your observation of client spikes from 15:17 to 15:36 is expected because at this time we moved the client application from accessing the old mongodb cluster to the new mongodb cluster. In general, our mongos'es have around 2k-3k connections each. Some mongos routers running mongodb v3.0.12 had already spikes upto 6k connections without any ExceededTimeLimit exception. Please see our tcp_timewait screenshot "tcp-tw_v3.0.12.jpg" of a mongos router running v3.0.12 which handles even more connections than the blocking router running v3.2.12 for which we posted yesterday the screenshot "fr-11_tcpwaitOnly.jpg".

Right now, since the new mongodb v3.2.12 cluster is not operational, we moved back our client application to the old cluster which does not encounter any of the blocking behaviours seen on the new v3.2.12 cluster.

As you can see in "Steps to reproduce" above, the query which may block by throwing ExceededTimeLimit errors in v3.2.12 is querying the _id field, which is also the shardkey. We can reproduce the issue by using another indexed, unique field, called "offerId".

We strongly think that this issue has to do with mongodb routers running v3.2.x because we are running 6 different mongodb clusters of over 100 bare metal machines, all having the same workload but only those which are using mongos routers v3.2 encounter ExceededTimeLimit errors. This makes it impossible for us to use mongodb v3.2 in production.
Thanks for your investigations to fix this bug!

Comment by Mira Carey [ 05/Apr/17 ]

kay.agahd@idealo.de,

I'm taking another look at the logic around how we handle ExceededTimeLimit calls, but in the meanwhile I have some observations, and some setparameters I'd like you to try if you can.

Observations:

Load seems high

You're seeing a lot of new clients in a small amount of time. At the point at which ExceededTimeLimit starts to dominate the logs (15:36), we're at 994 connections open to the mongos. As recently as 15:17, 20 minutes prior, the number of clients was stable at less than 100. There's also a lot of churn going on, with 1271 new clients in the minute of 15:33 (note that this is prior to exceeded time limit showing up substantially in the logs) and 923 retiring ones. This speaks to a workload more complex than one connection making a sequence of queries.

That load is generating a lot of outbound connections

All of those clients are triggering a lot of new connections to various mongods. Going back to 15:33 (before we began to see connection timeouts), we see as many as 210 new connections to the 'mongo-066' mongod in a single minute. That isn't long enough for any of those connections to exit the pool via regular expiration. It also lines up with the 350 net new clients we gained in that minute (assuming poor connection re-use, perhaps because the burst in traffic was overloading the various mongods and preventing them from returning quickly).

All of the ExceededTimeLimit calls you see, happen on one particular TaskExecutor

Despite work being well distributed across the various executors early in the logs, by 15:36, the bulk of the connecting, and failing to connect, occurs in 'NetworkInterfaceASIO-TaskExecutorPool-0-0'. The fact that the other pools don't see similar log messages, probably points to something of a death spiral. In this case, I don't believe that it's the same behavior as in SERVER-27232, but rather, an unfortunate artifact of retrying connection failures.

The way connection timeouts work goes something like this:

  1. user tries to run operation with a 1 minute timeout
  2. user requests a connection from the connection pool
  3. connection pool tries to connect (because a request exists that it cannot satisfy), with a default timeout of 20 seconds
    1. connection fails
    2. goto 3, because we still have an outstanding request for a connection
  4. users's request eventually times out, after 3 connects have failed (20 + 20 + 20 seconds)

That can clearly exasperate an already bad problem, as one pool lacks enough connections to satisfy it's requests, and keeps loading the server with new connections to accept from old and new clients. That problem is acute, and comes in waves, because a set of connections that timeout around the same time will generate new connection attempts at the same time.

Some questions

  • Is that client spike expected? I'm not familiar enough with your workload to know if 15:17 -> 15:36 would expect to multiply active clients by 10x
  • How many of those operations did you expect to be long running? I.e. what % of the overall workload is indexed point queries?

If the spike was unexpected, you many be seeing a situation where your connection pools are too large in your client application. Things normally hum along fine, but as soon as a network blip comes in, or a mongod becomes unresponsive for a moment (maybe a step down), many new connections are suddenly established, that previously weren't active (as most requests could be served by the same smaller pool of connections in serial).

Some suggestions

Limit the number of mongos -> mongod connections

If the problem is that you're introducing many new clients, which are creating many connections to various mongods, that in turn load those machines, then you might benefit from setting ShardingTaskExecutorPoolMaxSize. It constrains the maximum number of connections in any given pool in mongos to any given host. This is viable only if you know that most operations can be expected to be short lived (as setting this to a low number could allow long running queries to eventually crowd out and starve short running operations).

It interacts with TaskExecutorPoolSize (The number of connection pools to use for a given mongos) and defaults to 4 <= NUM_CORES <= 64. It appears that your mongos' run on 4 core boxes, so you probably don't need to tune it, but it affects the math for choosing the poolMaxSize.

Using something like lsof, to determine the number of stable connections from a particular mongos to any given mongod, you could divide that number by 4 (to find the cap per pool), then add in a bit of a fudge factor (perhaps 2-3x) and see if limiting connection churn fixes your problem. That should insulate you from having networking/perf glitches turn into connection storms.

Effectively, this solution involves using mongos to perform soft admission control / queueing rather than attempting to satisfy many requests in parallel.

Make the connection timeout longer

If your system is capable of re-stabilizing from a high number of active clients on mongos (and many active connections on the mongod), you may be able to simply raise the timeout on connect. By changing ShardingTaskExecutorRefreshTimeoutMS from it's default of 20000, to something more generous, like 60000, you may be able to weather these bursts of traffic. This works if your problem is bursts of traffic, followed by relatively idle periods, and that the queries you're running having timeouts >20 seconds today.

Comment by Kay Agahd [ 05/Apr/17 ]

jonhyman, just for info, in contrast to SERVER-28232 our config servers are CSRS and not SCCC. However, my formerly created ticket SERVER-26722 described the issue where config servers were SCCC.

Comment by Kay Agahd [ 05/Apr/17 ]

Thanks ramon.fernandez for your fast reply. I've uploaded the first 70000 lines of the router log file as router-11.log.1.tgz to your upload portal. They contain 736 ExceededTimeLimit error lines. Let me know if you need further logs.

Comment by Jon Hyman [ 05/Apr/17 ]

This is likely related to our ticket at SERVER-28232.

Comment by Ramon Fernandez Marina [ 05/Apr/17 ]

Sorry your cluster is having issues kay.agahd@idealo.de, I've created a secure upload portal for you to upload the logs. I'm thinking that a section of the logs from a mongos startup until the first few ExceededTimeLimit messages appear should be a good start, but feel free to upload more if you want.

Thanks,
Ramón.

Comment by Kay Agahd [ 04/Apr/17 ]

The log file of the router is already over 600 MB. If you need it, where can we upload it?

Generated at Thu Feb 08 04:18:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.