[SERVER-38074] abnormal performance downgrade and large traffice between primary shard and config server Created: 10/Nov/18  Updated: 12/Nov/18  Resolved: 12/Nov/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.2.20
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Adun Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Mongo Version 3.2.20, clusters deployed as: 15 or more shard replica sets + 1 config replica sets.

Each shard has three nodes(1 primary + 2 secondary), the storage is mmapv1.

Abnormal performance downgrade occur irregularly:

1. The traffics output from all config nodes are very large.
2. only occurs between the primary node on primary shard and config servers.
3. when the situation occurs, we found large request from primary shard to config servers, logs as below:

log a:

2018-11-07T18:59:07.289959+08:00 [conn51755] remotely refreshing metadata for ${db}.${collection} with requested shard version 0|0||000000000000000000000000, current shard version is 23075|5302||5bd9add9f47203867a329afd, current metadata version is 23081|428||5bd9add9f47203867a329afd

log b:

2018-11-07T22:03:44.271591+08:00 [conn63943] updating metadata for ${db}.${collection} from shard version 23108|23||5bd9add9f47203867a329afd to shard version 23108|23||5bd9add9f47203867a329afd, took 850 ms

4. at this time, large slow logs were generated on primary node, the op time were growing, db.currentOp() on primary shard, shows that many queries were `"msg" : "waiting for write concern",`

5. after we rise the loglevel, we found large FIND QUERY for chunks collection from primary node to config servers.

2018-11-09T19:03:00.370867+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Initiating asynchronous command: RemoteCommand 608607 – target:${config svr ip}:{port} db:config expDate:2018-11-09T19:03:30.370+0800 cmd:{ find: "chunks", filter: { ns: "${db}.${collection}", lastmod: { $gte: Timestamp 0|0 } }, sort: { lastmod: 1 }, readConcern: { level: "majority", afterOpTime:

Unknown macro: { ts}

}, maxTimeMS: 30000 }
2018-11-09T19:03:00.370883+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Starting asynchronous command 608607 on host ${config server ip}:${port}
2018-11-09T19:03:00.372395+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Request 608607 finished with response: { waitedMS: 0, cursor:

Unknown macro: { firstBatch}

, ok: 1.0 }
2018-11-09T19:03:00.373470+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Failed to time operation 608607 out: Operation aborted.

6. This situation can be solved temporarily through stepDown config primary or stopDown primary shard. But it will happen again in the uncertain time of the future.
7. after stepDown the primary shard, the slowlogs(UPDATE) on old primay node keep growing, we still need to restart the mongod on old primay nodeto stop it.
8. We don't have the method to reproduce, but it occurs.



 Comments   
Comment by Adun [ 12/Nov/18 ]

config servers use WiredTiger as storage engine.

shard servers use MMAPv1 as storeage engine.

It occured irregularly.  The cluster are available most of the time. 

Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ]

Hello Adun,

You mention in your initial description that you have config servers deployed in a replica set but you are using MMAPv1 as a storage engine. Please note that when deployed in a replica set, config servers must be using the WiredTiger storage engine.

Your current issue may be the result of config server unavailability. Config servers must be available at all times in order to ensure the normal operation of a sharded cluster.

For further MongoDB-related support discussion, please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group.

Thank you,

Danny

Generated at Thu Feb 08 04:47:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.