Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.2.10
Component/s: Querying, Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Sharding 2016-11-21, Sharding 2016-12-12
Case:
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We have updated one of our sharded cluster from v3.0.12 to v3.2.10. Since then our cluster was not operational because many operations got blocked by the the router. The corresponding log message looks like this:

2016-10-20T11:00:22.902+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to s559:27017 - ExceededTimeLimit: Operation timed out
2016-10-20T11:00:22.918+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-007.ipx:27017 - ExceededTimeLimit: Operation timed out
2016-10-20T11:00:22.920+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-024.ipx:27017 - ExceededTimeLimit: Operation timed out
2016-10-20T11:00:22.921+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-007.ipx:27017 - ExceededTimeLimit: Operation timed out
2016-10-20T11:00:22.921+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-024.ipx:27017 - ExceededTimeLimit: Operation timed out

We can reproduce the issue at any time just by executing a findOne through the router several times:

for(x=0;x<1000;x++){db.offer.find({"_id" : NumberLong("5672494983")}).forEach(function(u){printjson(u)});print(x)}

It blocks after a few findOne's already.
If we execute the same code on the shard where the document is located then there is no blocking at all.

We found out that mongodb router v3.0.12 doesn't have this problem. This is why we downgraded all our routers to v3.0.12 even though the rest of the cluster (mongod's) is running v3.2.10.

Please see attached the log file from the router.
Please see also 3 monitoring screenshots of the router TCP-sockets. As you can see, tcp_tw (tcp_timeWait) is much higher for v3.2.10 than for v3.0.12.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

fr-11_tcpwait.jpg
173 kB
Apr 04 2017 11:07:33 PM UTC
fr-11_tcpwaitOnly.jpg
169 kB
Apr 04 2017 11:07:33 PM UTC
offerstore-en-router-03_afterDowngradeToV3.0.10.png
201 kB
Oct 21 2016 10:16:33 AM UTC
offerstore-en-router-03_afterUpgradeToV3.2.10.png
201 kB
Oct 21 2016 10:16:33 AM UTC
offerstore-en-router-03_beforeUpgrade.png
202 kB
Oct 21 2016 10:16:33 AM UTC
offerstore-en-router-03_testWithV3.2.10.png
232 kB
Oct 26 2016 08:18:54 AM UTC
offerstore-en-router-03.ipx.2016-10-20.log.tgz
9.46 MB
Oct 21 2016 10:16:32 AM UTC
offerstore-en-router-03.ipx.2016-10-26.log.tgz
3.74 MB
Oct 26 2016 08:18:54 AM UTC
v3.2.11 - 201.zip
102.86 MB
Nov 29 2016 05:01:21 PM UTC
v3.2.11 - 202.zip
16.96 MB
Nov 29 2016 05:01:16 PM UTC
v3.2.11 - 203.zip
34.03 MB
Nov 29 2016 05:01:18 PM UTC

duplicates

SERVER-27232 Refresh and Setup timeouts in the ASIO connpool can prematurely time out an operation

Closed

is related to

SERVER-26723 Mongos stalls for even simple queries

Closed

SERVER-26859 AsyncResultsMerger replica set retargeting may block the ASIO callback threads

Closed

Assignee:: Mira Carey
Reporter:: Kay Agahd
Participants:: Alessandro Gherardi, Antonis Giannopoulos, Jon Hyman, Kaloian Manassiev, Kay Agahd, Kelsey Schubert, Mira Carey, Roy Reznik
Votes:: 2 Vote for this issue
Watchers:: 17 Start watching this issue

Created:: Oct 21 2016 10:16:32 AM UTC
Updated:: Jan 08 2024 03:23:09 PM UTC
Resolved:: Dec 08 2016 06:08:38 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates