[SERVER-32418] Due to a lag, query is very slow in the config server Created: 20/Dec/17  Updated: 27/Oct/23  Resolved: 05/Mar/18

Status: Closed
Project: Core Server
Component/s: Performance, Sharding
Affects Version/s: 3.4.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: liangzhang Assignee: Dmitry Agranat
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File config.log.rar     File config.log.tgz     File diagnostic.data.tgz     PNG File ftdc.png     PNG File lag.png     File log-20180223-16.rar    
Operating System: ALL
Participants:

 Description   

System environment
System: CentOS release 6.6
Kernel: 2.6.32-504.el6.x86_64
Memory: 256G
Disk: SSD-PCIE
File System: xfs
mongo version: 3.4.10

The config server has three members in my cluster, one of members that is secondary shows some slow queries in logs file. When I executed sh.disableAutoSplit(), there was more slow queries on this secondary's log. And the ops of all members have risen to 2k/s.

2017-12-19T20:14:12.749+0800 I COMMAND  [conn4490] command config.settings command: find { find: "settings", filter: { _id: "autosplit" }, readConcern: { level: "majority",
 afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nretur
ned:1 reslen:391 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4997ms
2017-12-19T20:14:12.749+0800 I COMMAND  [conn4585] command config.settings command: find { find: "settings", filter: { _id: "balancer" }, readConcern: { level: "majority", 
afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturn
ed:1 reslen:404 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4997ms
2017-12-19T20:14:12.749+0800 I COMMAND  [conn4602] command config.settings command: find { find: "settings", filter: { _id: "chunksize" }, readConcern: { level: "majority",
 afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nretur
ned:1 reslen:392 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4989ms
2017-12-19T20:14:12.749+0800 I COMMAND  [conn4583] command config.settings command: find { find: "settings", filter: { _id: "balancer" }, readConcern: { level: "majority", 
afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturn
ed:1 reslen:404 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4989ms
2017-12-19T20:14:12.749+0800 I COMMAND  [conn4580] command config.settings command: find { find: "settings", filter: { _id: "chunksize" }, readConcern: { level: "majority",
 afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nretur
ned:1 reslen:392 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4994ms
2017-12-19T20:14:12.749+0800 I COMMAND  [conn4545] command config.settings command: find { find: "settings", filter: { _id: "autosplit" }, readConcern: { level: "majority",
 afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nretur
ned:1 reslen:391 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4997ms
2017-12-19T20:14:12.749+0800 I COMMAND  [conn4600] command config.settings command: find { find: "settings", filter: { _id: "balancer" }, readConcern: { level: "majority", 
afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturn
ed:1 reslen:404 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4963ms
2017-12-19T20:14:12.749+0800 I COMMAND  [conn4451] command config.settings command: find { find: "settings", filter: { _id: "balancer" }, readConcern: { level: "majority", 
afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturn
ed:1 reslen:404 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4988ms
2017-12-19T20:14:12.749+0800 I COMMAND  [conn4572] command config.settings command: find { find: "settings", filter: { _id: "chunksize" }, readConcern: { level: "majority",
 afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nretur
ned:1 reslen:392 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4976ms
2017-12-19T20:14:12.749+0800 I COMMAND  [conn4387] command config.settings command: find { find: "settings", filter: { _id: "balancer" }, readConcern: { level: "majority", 
afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturn
ed:1 reslen:404 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4908ms
2017-12-19T20:14:12.749+0800 I COMMAND  [conn4578] command config.settings command: find { find: "settings", filter: { _id: "chunksize" }, readConcern: { level: "majority",
 afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nretur
ned:1 reslen:392 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4974ms
2017-12-19T20:14:12.749+0800 I COMMAND  [conn4607] command config.settings command: find { find: "settings", filter: { _id: "autosplit" }, readConcern: { level: "majority",
 afterOpTime: { ts: Timestamp 1513685647000|1, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nretur
ned:1 reslen:391 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4990ms



 Comments   
Comment by Dmitry Agranat [ 04/Mar/18 ]

Hi ufozhangliang@163.com,

Thank you for providing the requested information.

As suspected, when there is a lag, the reported query is slow in the config server. This is expected as explained in my previous comment. Looking at the two such occurrences:

  • The times presented above are in UTC
  • A and B represent the start of the slow query which correlates with the lag:

    2018-03-02T15:38:23.554+0800 I COMMAND  [conn405762] command config.settings command: find { find: "settings", filter: { _id: "balancer" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1519976298000|5, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:404 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4998ms
    2018-03-02T15:45:23.509+0800 I COMMAND  [conn405848] command config.settings command: find { find: "settings", filter: { _id: "chunksize" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1519976718000|4, t: 9 } }, limit: 1, maxTimeMS: 30000 } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:392 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 4998ms
    

I am going to close this ticket as "works as designed". However, I'd like to mention one abnormality I've noticed:

  • Server BJ-DB-107 is 3.2.13
  • Server ZW-10-0-1-76 is 3.4.10
  • Server BJ-DB-109 is 3.4.10

The practice of mixing different major MongoDB versions in a replica set is not recommended (except for the time of the upgrade). Perhaps, by fixing this misconfiguration, you might fix the issue with the periodic lag in your config servers replica set.

Thanks,
Dima

Comment by liangzhang [ 02/Mar/18 ]

files have been uploaded .

Comment by Dmitry Agranat [ 28/Feb/18 ]

Hi ufozhangliang@163.com,

Thank you for providing the requested information.

I'd like to validate that there is no replication lag between your config servers. Please note that the read is doing afterOptime in combination with readConcern: majority so the query will block, holding no locks, until a majority of members have reached the specified optime.

  • Please archive (tar or zip) the $dbpath/diagnostic.data directory from all the config servers covering the time when the query is very slow on the config server and upload it to this ticket.
  • A fresh set of config server logs where the query is very slow.

Thanks,
Dima

Comment by liangzhang [ 23/Feb/18 ]

Sorry, the reply is delayed. Uploaded log of all nodes.All nodes have this problem.

Comment by Kelsey Schubert [ 16/Feb/18 ]

Hi ufozhangliang@163.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please address the questions in Mark's most recent comment?

Thank you,
Kelsey

Comment by Mark Agarunov [ 23/Jan/18 ]

Hello ufozhangliang@163.com,

Thank you for the additional information and my apologies for the delay in replying. Unfortunately we have not yet been able to determine the cause of this behavior. If possible, could you provide the complete logs from all nodes in the cluster? Additionally, is this behavior causing adverse issues such as performance regressions? Do all the nodes experience the same issue or only this specific node?

Thanks,
Mark

Comment by liangzhang [ 27/Dec/17 ]

if re-enable autosplit, this slow query will also have less than disabled, but there will also be.
when re-enabled autosplit, ops will be reduced a lot,but there show less slow query in logs.

Comment by Mark Agarunov [ 26/Dec/17 ]

Hello ufozhangliang@163.com,

Thank you for providing this information. After looking over the logs and diagnostic data, it seems that the slow queries are correlated with saslStart and saslContinue commands:

This leads me to suspect that the slow queries may be due to a slow response from the sasl authentication. Was this behavior present in any capacity before disabling autosplitting, and does it continue if autosplit is re enabled?

Thanks
Mark

Comment by Mark Agarunov [ 21/Dec/17 ]

Hello ufozhangliang@163.com,

Thank you for the report. To get a better idea of what may be causing this behavior, could you please provide the following:

  • The complete log files from all affected node
  • Please archive (tar or zip) the $dbpath/diagnostic.data directory and upload it to this ticket.

This should give some insight into this issue.

Thanks,
Mark

Comment by liangzhang [ 20/Dec/17 ]

config server queries take about 2s, and sometimes it would block our client query.

Generated at Thu Feb 08 04:30:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.