[SERVER-23562] cluster hung a moment when use 'aaa' which not in the cluster Created: 06/Apr/16  Updated: 14/Jun/16  Resolved: 14/Jun/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: ufo ofu Assignee: Kaloian Manassiev
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File QQ图片20160408120437.png    
Issue Links:
Duplicate
duplicates SERVER-17917 Config DB gets updated even though th... Closed
Sprint: Sharding 14 (05/13/16), Sharding 15 (06/03/16), Sharding 16 (06/24/16)
Participants:

 Description   

cluster hung a moment when use 'aaa' which not in the cluster.

all connection will reconnected.
log file:

2015-12-11T23:59:07.379+0800 I NETWORK [initandlisten] connection accepted from #152741 (103 connections now open)
2015-12-11T23:59:07.502+0800 I NETWORK [conn234578] end connection 192.168.20.113:35936 (81 connections now open)
2015-12-11T23:59:07.503+0800 I NETWORK [initandlisten] connection accepted from 192.168.20.113:35940 #234582 (82 connections now open)
2015-12-11T23:59:08.081+0800 I SHARDING [conn15614] couldn't find database [aaa] in config db
2015-12-11T23:59:08.384+0800 I NETWORK [conn241020] end connection (1418 connections now open)
2015-12-11T23:59:08.385+0800 I NETWORK [initandlisten] connection accepted from #241025 (1419 connections now open)
2015-12-11T23:59:08.755+0800 I NETWORK [conn132274] end connection (901 connections now open)
2015-12-11T23:59:08.755+0800 I NETWORK [initandlisten] connection accepted from #132278 (902 connections now open)

environment
linux: centos 6.5
mongo-version:3.0.7
filesystem:xfs
mongod-engine wiredTiger
configure-engine mmapv1



 Comments   
Comment by Kaloian Manassiev [ 14/Jun/16 ]

Hi liang.zhang@17zuoye.com,

The bug in question (SERVER-17917) is not going to be backported to 3.0 because of its complexity and due to risk of destabilizing the release. Since it has been fixed in 3.2, if this is a critical issue for you I recommend that you upgrade to the latest release (3.2.7 at the time of this writing) at your earliest convenience.

Best regards,
-Kal.

Comment by Kaloian Manassiev [ 31/May/16 ]

Thank you liang.zhang@17zuoye.com for providing me with the set of indexes. I do not see anything wrong with these indexes so the most likely cause for the stalls when trying to use a non-existent database remains to be SERVER-17917.

Like I mentioned, this bug was fixed in version 3.2, but is still present in 3.0 and earlier versions. Would it be possible that you attempt to reproduce it with the latest 3.2 release and let me know if this helps?

Best regards,
-Kal.

Comment by ufo ofu [ 26/May/16 ]

This behavior can occur in the cluster:

A total of five shards
Each shard has 1 primary 2 secondary 2arb
Each shard has 800G storage size(snappy compress)

mongos> use config
mongos> db.system.indexes.find()
{ "v" : 1, "key" :

{ "_id" : 1 }

, "ns" : "config.lockpings", "name" : "id" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "ns" : "config.locks", "name" : "id" }
{ "v" : 1, "key" :

{ "ping" : 1 }

, "ns" : "config.lockpings", "name" : "ping_1" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "ns" : "config.changelog", "name" : "id" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "ns" : "config.version", "name" : "id" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "ns" : "config.settings", "name" : "id" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "ns" : "config.chunks", "name" : "id" }
{ "v" : 1, "key" :

{ "ns" : 1, "min" : 1 }

, "unique" : true, "ns" : "config.chunks", "name" :"ns_1_min_1" }
{ "v" : 1, "key" :

{ "ns" : 1, "shard" : 1, "min" : 1 }

, "unique" : true, "ns" : "config.chunks", "name" : "ns_1_shard_1_min_1" }
{ "v" : 1, "key" :

{ "ns" : 1, "lastmod" : 1 }

, "unique" : true, "ns" : "config.chunks", "name" : "ns_1_lastmod_1" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "ns" : "config.shards", "name" : "id" }
{ "v" : 1, "key" :

{ "host" : 1 }

, "unique" : true, "ns" : "config.shards", "name" : "host_1" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "ns" : "config.mongos", "name" : "id" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "ns" : "config.databases", "name" : "id" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "ns" : "config.collections", "name" : "id" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "ns" : "config.tags", "name" : "id" }
{ "v" : 1, "key" :

{ "ns" : 1, "min" : 1 }

, "unique" : true, "ns" : "config.tags", "name" : "ns_1_min_1" }
{ "v" : 1, "key" :

{ "state" : 1, "process" : 1 }

, "ns" : "config.locks", "name" : "state_1_process_1" }
{ "v" : 1, "key" :

{ "ts" : 1 }

, "name" : "ts_1", "ns" : "config.locks" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.actionlog" }

Another cluster case this phenomenon did not occur
A total of two shards
Each shard has 1 primary 1 secondary 1 arb
Each shard has 100G storage size(snappy compress)

mongos> use config
mongos> db.system.indexes.find()
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.changelog" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.actionlog" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.locks" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.chunks" }
{ "v" : 1, "key" :

{ "ts" : 1 }

, "name" : "ts_1", "ns" : "config.locks" }
{ "v" : 1, "key" :

{ "state" : 1, "process" : 1 }

, "name" : "state_1_process_1", "ns" : "config.locks" }
{ "v" : 1, "unique" : true, "key" :

{ "ns" : 1, "min" : 1 }

, "name" : "ns_1_min_1", "ns" : "config.chunks" }
{ "v" : 1, "unique" : true, "key" :

{ "ns" : 1, "shard" : 1, "min" : 1 }

, "name" : "ns_1_shard_1_min_1", "ns" : "config.chunks" }
{ "v" : 1, "unique" : true, "key" :

{ "ns" : 1, "lastmod" : 1 }

, "name" : "ns_1_lastmod_1", "ns" : "config.chunks" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.collections" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.lockpings" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.databases" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.mongos" }
{ "v" : 1, "key" :

{ "ping" : 1 }

, "name" : "ping_1", "ns" : "config.lockpings" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.shards" }
{ "v" : 1, "unique" : true, "key" :

{ "host" : 1 }

, "name" : "host_1", "ns" : "config.shards" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.tags" }
{ "v" : 1, "unique" : true, "key" :

{ "ns" : 1, "min" : 1 }

, "name" : "ns_1_min_1", "ns" : "config.tags" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.settings" }
{ "v" : 1, "key" :

{ "_id" : 1 }

, "name" : "id", "ns" : "config.version" }

Comment by Kaloian Manassiev [ 23/May/16 ]

Thanks for the prompt response.

The metadata operations I am referring to happen against the config.databases collection. How big is it in each of the two clusters you have listed above? Also, this collection should also have index on it for the database name. Can you please run db.getIndexes() and paste the output?

Thanks in advance.

-Kal.

Comment by ufo ofu [ 23/May/16 ]

Thank you for your answer.
This behavior can occur in the cluster:
mongos> db.collections.count()
5818
mongos> db.collections.find(

{"dropped" : false}

).count()
3936
mongos> db.chunks.count()
237463

Another cluster case this phenomenon did not occur
mongos> db.collections.count()
569
mongos> db.collections.find(

{"dropped" : false}

).count()
222
mongos> db.chunks.count()
13322

Is it because the amount of data in congfig database?

Comment by Kaloian Manassiev [ 23/May/16 ]

Hi liang.zhang@17zuoye.com,

Apologies for the delayed response. I tried to reproduce the problem you are experiencing, by accessing an unknown database and I am unable to get any existing shard connections to get reset.

However, one thing which I noticed is related to SERVER-17917. Because of this bug, each attempt to read an unknown database will try to find and load that database's metadata from the config server and if it does not exist there will create it. This is done through a write to the config server. This holds a mongos-wide mutex and if it takes longer because of config server network round trips will stall any concurrent workload running against the same mongos instance.

My suspicion is that you are experiencing the side effects of SERVER-17917 and not reconnection problems.

This bug was fixed in version 3.2, but is still present in 3.0 and earlier versions. Would it be possible that you attempt to reproduce it with the latest 3.2 release and let me know if still occurs?

Best regards,
-Kal.

Comment by ufo ofu [ 08/Apr/16 ]

environment

linux: centos 6.5 
mongod       version:3.0.8
mongos       version:3.0.8
mongo-configure version  :3.0.8 
filesystem:xfs 
mongod-engine wiredTiger
configure-engine mmapv1

run command:

use aaa              (aaa  not in the cluster)
show tables        (this  command  caused  the hang) 

"cluster hung" means all connection will reconnected at a moment. The client is due to reconnect, resulting in slow operation. The client's Slow operation is mainly due to reconnect.

mongos log

2016-04-05T18:11:33.486+0800 I SHARDING [conn350462] couldn't find database [aaa] in config db
2016-04-05T18:11:33.538+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:30705 #350466 (1118 connections now open)
2016-04-05T18:11:33.589+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:30706 #350467 (1119 connections now open)
2016-04-05T18:11:33.613+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:49538 #350468 (1120 connections now open)
2016-04-05T18:11:33.622+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:48935 #350469 (1121 connections now open)
2016-04-05T18:11:33.625+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:49539 #350470 (1122 connections now open)
2016-04-05T18:11:33.632+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:30707 #350471 (1123 connections now open)
2016-04-05T18:11:33.660+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:49542 #350472 (1124 connections now open)
2016-04-05T18:11:33.661+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:37624 #350473 (1125 connections now open)
2016-04-05T18:11:33.669+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:48938 #350474 (1126 connections now open)
2016-04-05T18:11:33.672+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:48939 #350475 (1127 connections now open)
2016-04-05T18:11:33.674+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:48940 #350476 (1128 connections now open)
2016-04-05T18:11:33.676+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:49543 #350477 (1129 connections now open)
2016-04-05T18:11:33.678+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:37625 #350478 (1130 connections now open)
2016-04-05T18:11:33.681+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:30710 #350479 (1131 connections now open)
2016-04-05T18:11:33.681+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:56589 #350480 (1132 connections now open)
2016-04-05T18:11:33.683+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:29306 #350481 (1133 connections now open)
2016-04-05T18:11:33.685+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:30712 #350482 (1134 connections now open)
2016-04-05T18:11:33.685+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:55877 #350483 (1135 connections now open)
2016-04-05T18:11:33.689+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:62951 #350484 (1136 connections now open)
2016-04-05T18:11:33.691+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:54304 #350485 (1137 connections now open)
2016-04-05T18:11:33.692+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:48941 #350486 (1138 connections now open)
2016-04-05T18:11:33.698+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:48942 #350487 (1139 connections now open)
2016-04-05T18:11:33.702+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:56591 #350488 (1140 connections now open)
2016-04-05T18:11:33.706+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:31129 #350489 (1141 connections now open)
2016-04-05T18:11:33.706+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:31130 #350490 (1142 connections now open)
2016-04-05T18:11:33.714+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:31131 #350491 (1143 connections now open)
2016-04-05T18:11:33.716+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:54305 #350492 (1144 connections now open)
2016-04-05T18:11:33.719+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:54306 #350493 (1145 connections now open)
2016-04-05T18:11:33.719+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:35992 #350494 (1146 connections now open)
2016-04-05T18:11:33.720+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:54307 #350495 (1147 connections now open)
2016-04-05T18:11:33.723+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:37628 #350496 (1148 connections now open)
2016-04-05T18:11:33.730+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:31132 #350497 (1149 connections now open)
2016-04-05T18:11:33.733+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:35993 #350498 (1150 connections now open)
2016-04-05T18:11:33.733+0800 I NETWORK  [mongosMain] connection accepted from x.x.x.x:30714 #350499 (1151 connections now open)

Comment by Kaloian Manassiev [ 06/Apr/16 ]

Hi liang.zhang@17zuoye.com,

Can you please provide the following information to help us better understand the problem you are encountering.

  • The version of mongod and mongos, which you are using and the complete log file from the mongos against which you executed the commands.
  • The exact commands which you ran, which caused the hang.

Also, can you elaborate on what does "cluster hung" means? I.e., did the command timeout or it just got stuck there indefinitely.

Best regards,
-Kal.

Generated at Thu Feb 08 04:03:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.