Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.2.20
Component/s: None
Labels:
None

Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Mongo Version 3.2.20, clusters deployed as: 15 or more shard replica sets + 1 config replica sets.

Each shard has three nodes(1 primary + 2 secondary), the storage is mmapv1.

Abnormal performance downgrade occur irregularly:

1. The traffics output from all config nodes are very large.
2. only occurs between the primary node on primary shard and config servers.
3. when the situation occurs, we found large request from primary shard to config servers, logs as below:

log a:

2018-11-07T18:59:07.289959+08:00 [conn51755] remotely refreshing metadata for ${db}.${collection} with requested shard version 0|0||000000000000000000000000, current shard version is 23075|5302||5bd9add9f47203867a329afd, current metadata version is 23081|428||5bd9add9f47203867a329afd

log b:

2018-11-07T22:03:44.271591+08:00 [conn63943] updating metadata for ${db}.${collection} from shard version 23108|23||5bd9add9f47203867a329afd to shard version 23108|23||5bd9add9f47203867a329afd, took 850 ms

4. at this time, large slow logs were generated on primary node, the op time were growing, db.currentOp() on primary shard, shows that many queries were `"msg" : "waiting for write concern",`

5. after we rise the loglevel, we found large FIND QUERY for chunks collection from primary node to config servers.

2018-11-09T19:03:00.370867+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Initiating asynchronous command: RemoteCommand 608607 – target:${config svr ip}:{port} db:config expDate:2018-11-09T19:03:30.370+0800 cmd:{ find: "chunks", filter: { ns: "${db}.${collection}", lastmod: { $gte: Timestamp 0|0 } }, sort: { lastmod: 1 }, readConcern: { level: "majority", afterOpTime:

Unknown macro: { ts}

}, maxTimeMS: 30000 }
2018-11-09T19:03:00.370883+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Starting asynchronous command 608607 on host ${config server ip}:${port}
2018-11-09T19:03:00.372395+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Request 608607 finished with response: { waitedMS: 0, cursor:

Unknown macro: { firstBatch}

, ok: 1.0 }
2018-11-09T19:03:00.373470+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Failed to time operation 608607 out: Operation aborted.

6. This situation can be solved temporarily through stepDown config primary or stopDown primary shard. But it will happen again in the uncertain time of the future.
7. after stepDown the primary shard, the slowlogs(UPDATE) on old primay node keep growing, we still need to restart the mongod on old primay nodeto stop it.
8. We don't have the method to reproduce, but it occurs.

Assignee:: Danny Hatcher (Inactive)
Reporter:: Adun
Participants:: Adun, Danny Hatcher
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Nov 10 2018 06:09:36 AM UTC
Updated:: Nov 12 2018 02:40:50 PM UTC
Resolved:: Nov 12 2018 02:02:27 PM UTC

Details

Description

Attachments

Activity

People

Dates