[SERVER-40130] Improve multi-threading Created: 14/Mar/19  Updated: 16/May/19  Resolved: 20/Mar/19

Status: Closed
Project: Core Server
Component/s: Performance, Storage
Affects Version/s: 3.4.17
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Pichardie kévin Assignee: Eric Sedor
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File iostat.log.gz     File metrics.tar.gz    
Issue Links:
Duplicate
duplicates SERVER-25025 Improve startup time when there are t... Closed
Participants:

 Description   

Hello, at sendinblue we use mongodb since a long time and have a few clusters running with big datasets.

We currently struggle with our cluster using Wiretiger, slow start, replication not able to succeed or is very long.... 

Here is information about the sizing of cluster we have here is some stats : 

  • mongodb version 3.4.17
  • 10 shards
  • databases per shard : ~ 26K
  • collections per shard : ~ 600K
  • shard size : ~ 500G
  • files in data directory for the tested shard :  1 873 414  (find /data/ | wc -l)
    • FYI we didn't split journal, data, indexes

I have currently isolated one shard of a cluster to do some debugging about the bottlenek we encounter. I troubleshoot issues on a secondary which is currently on a google cloud  instance seems to freeze a lot while starting, play oplogs ...

We identified that the shard looks like to have some process running as mono thread or not efficently multi-threaded. This instance run in instance of 16 vCPU at 2.5GHz and 96G of memory. 

 

At the starting of the mongod instance it take very long time and  statistics on server seems to show that one or 2 vCPU are effectively working. 

We have found some relative information here : https://jira.mongodb.org/browse/SERVER-27700?focusedCommentId=1480933&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-1480933

But as we already use a version that have the improvments, we shouldn't strugle on replication op are evictions. 

Here is the current configuration we use : 
```STORAGE [initandlisten] wiredtiger_open config: create,cache_size=40960M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),verbose=(recovery_progress)```

 

We have tried some modification of WireTiger based on this document bellow and  the  comment on the previous Jira link : https://source.wiredtiger.com/2.9.0/group__wt.html#gab435a7372679c74261cb62624d953300

Currently my configuration is : 

```net:
bindIp: 0.0.0.0
port: XXXXX
processManagement:
pidFilePath: /var/run/mongodb/shard1.pid
replication:
  oplogSizeMB: 10240
   replSetName: XXXXX
setParameter:
  cursorTimeoutMillis: 1800000
  failIndexKeyTooLong: true
sharding:
  clusterRole: shardsvr
storage:
  dbPath: /data/
   engine: wiredTiger
   wiredTiger:
   engineConfig:
     cacheSizeGB: 40
```

Is there any setup that permit to increase the multi processing at startup and on the replication process because seems that some process are not. 

```shard1-:# ps -T -p 32107
PID SPID TTY TIME CMD
32107 32107 ? 00:05:27 shard
32107 32109 ? 00:00:00 signalP.gThread
32107 32110 ? 00:00:00 Backgro.kSource
32107 32216 ? 00:00:00 shard
32107 32217 ? 00:00:00 shard
32107 32218 ? 00:00:00 shard
32107 32219 ? 00:00:24 shard
32107 32220 ? 00:00:24 shard
32107 32221 ? 00:00:24 shard
32107 32222 ? 00:00:24 shard
32107 32223 ? 00:00:02 shard
32107 32224 ? 00:01:03 shard
32107 32225 ? 00:00:01 WTJourn.Flusher

``` 

During startup we clearly see that the server is stuck on  this with one process with 100% cpu and the rest doing nothing almost :

```2019-03-14T17:38:24.611+0000 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=40960M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),verbose=(recovery_progress),file_manager=(close_handle_minimum=10000,close_idle_time=3600,close_scan_interval=10)
2019-03-14T17:38:42.046+0000 I STORAGE [initandlisten] WiredTiger message [1552585122:46632][32107:0x7fd0dd295d40], txn-recover: Main recovery loop: starting at 401225/10635776
2019-03-14T17:38:42.047+0000 I STORAGE [initandlisten] WiredTiger message [1552585122:47626][32107:0x7fd0dd295d40], txn-recover: Recovering log 401225 through 401226
2019-03-14T17:38:42.588+0000 I STORAGE [initandlisten] WiredTiger message [1552585122:588204][32107:0x7fd0dd295d40], file:collection-75452--270635807257442042.wt, txn-recover: Recovering log 401226 through 401226

 

ls alih /data/collection-75452-270635807257442042.wt
2274678025 rw-rr- 1 mongodb mongodb 1.2M Mar 14 17:38 /data/collection-75452-270635807257442042.wt

free -mh
total used free shared buffers cached
Mem: 94G 61G 32G 32M 85M 41G
-/+ buffers/cache: 20G 74G
Swap: 24G 0B 24G

``` This step take 40min and using only 1 cpu. 

Can you help on this ? I know you will need more info that i can probably provide.

 

Thanks in advance

 

 

 

 

 

 



 Comments   
Comment by Pichardie kévin [ 21/Mar/19 ]

Hello Eric,

 

Yes but this case isn't updated since 2016 :s. Should i update on this case ?

Can you also share your finding to keep traces of bottleneck, and that we can summup on case SERVER-25025.

Best regards,

Kévin

Comment by Eric Sedor [ 20/Mar/19 ]

At this time I'm going to close this ticket as a duplicate. Can you please see SERVER-25025 for future updates?

Comment by Eric Sedor [ 20/Mar/19 ]

kpichardie, with these logs we've been able to rule out some other possibilities and currently believe that this is a case of SERVER-25025. However, it's likely exacerbated by there being 1.8 million files, versus tens of thousands.

Comment by Eric Sedor [ 20/Mar/19 ]

Thanks Kévin, this helps; We are taking a look and will let you know.

Comment by Pichardie kévin [ 20/Mar/19 ]

Hello Eric,

I have retried with normal stop and time is mostly the same. See log attached on secure protal. 

Let me know if you need more info. 

Kévin

Comment by Pichardie kévin [ 20/Mar/19 ]

Hello Eric,

 

I have normally stopped the process using systemd but i think it's also take a long time to stop arround 5 min and i will check but systemd probably killed it. 

As default timeout is 5 minutes i do think mongo is killed. 

 

I will try increase the timeout to check the time to restart. 

 

Kévin

 

Comment by Eric Sedor [ 19/Mar/19 ]

Hi Kévin, the bulk of the time spent during restart appears to involve the work necessary to recover from an unclean shutdown. Can you clarify how you are stopping the node?

Comment by Pichardie kévin [ 19/Mar/19 ]

Hello Eric,

Please find the file of the logs during restart.

 

Best regards,

Kévin

Comment by Eric Sedor [ 19/Mar/19 ]

Of course Kévin; I've generated an uploader for you here

Comment by Pichardie kévin [ 19/Mar/19 ]

Hello Eric,

Can i have a link on secure portal to upload log information.

Sorry i missed it in comment but it switch as secondary just after STARTUP2

2019-03-14T23:23:40.186+0000 I REPL [rsSync] transition to SECONDARY

I will try to do the restart with iostat but we don't see any limitation on this part.

I have uploaded the iostat for my restart test, can you provide a secure link for the rest of files ?

Best regards,
Kévin

Comment by Eric Sedor [ 18/Mar/19 ]

kpichardie, we are tracking some known inefficiencies with high file counts in SERVER-25025 but would like to investigate more thoroughly to see if any other issues are in play. It looks like you may have snipped some log lines from the above comment, is that right? If so, can you please provide the complete logs between 2019-03-14T17:38:24 and a time when the node transitions to SECONDARY status?

It may also be helpful if you can repeat the collection of diagnostic data during the secondary's restart while also running the following:

delay=1
iostat -k -t -x ${delay:?} >iostat.log &

This will allow us to get disk metrics for additional periods during the startup which are not available in the current diagnostic data.

Thanks in advance!

Comment by Pichardie kévin [ 18/Mar/19 ]

Hello, 

I have uploaded diagnostics from the tests i made last thursday so the 14th and week-end.

Basically the problem is impacting replication (we frequently have lags) and startup that is slow. 

Problem is pretty constant on the replication we see several jump in the delay and we see that one of the CPU is pretty high while others are not overloaded.

But during 14th i was trying some custom wiretiger configuration like explained previously but no success.

Last restart with classic configuration params took a very long time :

2019-03-14T17:38:24.599+0000 I CONTROL [main] ***** SERVER RESTARTED *****
2019-03-14T17:38:24.607+0000 I CONTROL [initandlisten] MongoDB starting : pid=32107 port=XXXX dbpath=/data/ 64-bit host=shard1-2
2019-03-14T17:38:24.607+0000 I CONTROL [initandlisten] db version v3.4.17
2019-03-14T17:38:24.607+0000 I CONTROL [initandlisten] git version: 7c14a47868643bb691a507a92fe25541f998eca4
2019-03-14T17:38:24.607+0000 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1t 3 May 2016
.....
2019-03-14T23:23:16.471+0000 W NETWORK [ReplicaSetMonitor-TaskExecutor-0] Failed to connect to XXXXXX:XXXXX, in(checking socket for error after poll), reason: Connection refused
2019-03-14T23:23:40.139+0000 I NETWORK [thread2] waiting for connections on port XXXX
2019-03-14T23:23:40.183+0000 I REPL [replExecDBWorker-0] New replica set config in use: { _id: "shard1", version: 88, protocolVersion: 1, members: [ { _id: 2, host: "shard1-arb-1:XXX", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 3, host: "shard1-arb-2:XXXX", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 8, host: "shard1-1:XXXXX", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 100.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 9, host: "shard1-1:XXXXX", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 50.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 10, host: "shard1-2:XXXXX", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 2000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('591440486e078e6c1d57d210') } }
2019-03-14T23:23:40.183+0000 I REPL [replExecDBWorker-0] This node is shard1-2:XXXX in the config
2019-03-14T23:23:40.183+0000 I REPL [replExecDBWorker-0] transition to STARTUP2
2019-03-14T23:23:40.183+0000 I REPL [replExecDBWorker-0] Starting replication storage threads
2019-03-14T23:23:40.184+0000 I REPL [replExecDBWorker-0] Starting replication fetcher thread
2019-03-14T23:23:40.184+0000 I REPL [replExecDBWorker-0] Starting replication applier thread
2019-03-14T23:23:40.184+0000 I REPL [replExecDBWorker-0] Starting replication reporter thread
2019-03-14T23:23:40.184+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to shard1-arb-1:XXXXX

 

I believe that some processes are mono threaded and causing slow processing during startup and for replication as 1 CPU is saturated.

 

Comment by Eric Sedor [ 14/Mar/19 ]

Hello kpichardie and thanks for the information so far. To best help us understand the behavior you're observing can you please archive (tar or zip) the $dbpath/diagnostic.data directory (described here) for all nodes during a representative incident and attach it to this ticket?

Timestamps for said incidents will help us target our examination of this data.

Generated at Thu Feb 08 04:54:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.