[SERVER-75052] 5.0 shard replica startup takes much longer than 4.x due to WT checkpoint Created: 20/Mar/23  Updated: 08/Sep/23  Resolved: 08/Sep/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.0.15
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ian Springer Assignee: Yuan Fang
Resolution: Done Votes: 0
Labels: regression, slow, startup
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

Run a 5.0 sharded cluster with many databases and collections, and heavy read and write volume.

Participants:

 Description   

In 4.x, out shard replicas consistently took around 2 minutes to start up. In 5.0, they are taking anywhere from 3.5 minutes to 12 minutes to start up. The additional time to start is primarily consumed by a WT checkpoint. It appears that 5.0 performs a checkpoint as part of the startup process, whereas 4.x waited until after startup was complete to takes the first checkpoint.

 

Here's an abridged example of a 4.2 startup:

2022-11-22T02:45:20.979+0000 I  CONTROL  [main] ***** SERVER RESTARTED *****
2022-11-22T02:47:17.352+0000 I  REPL     [rsBackgroundSync] Changed sync source from empty to shard1repl3.example.com:27017
// TOTAL: 1m57s
//...
2022-11-22T02:49:07.362+0000 I  STORAGE  [WTCheckpointThread] WiredTiger message [1669085347:362181][77440:0x7f956c557700], file:enelglobalservicessrl3/collection-11873-6569341069224170235.wt, WT_SESSION.checkpoint: Checkpoint has been running for 20 seconds and wrote: 65000 pages (3797 MB)
2022-11-22T02:49:11.100+0000 I  STORAGE  [WTCheckpointThread] WiredTiger message [1669085351:100075][77440:0x7f956c557700], WT_SESSION.checkpoint: Checkpoint ran for 23 seconds and wrote: 83927 pages (5232 MB)
// CHECKPOINT: 23s

 

Notice, the startup takes less than 2 minutes, and the first checkpoint runs a couple minutes later.

 

And here's an abridged example of a 5.0 startup on the exact same machine:

 {"t":{"$date":"2023-03-20T13:33:45.171+00:00"},"s":"I",  "c":"CONTROL",  "id":20698,   "ctx":"-","msg":"***** SERVER RESTARTED *****"}
// ...
{"t":{"$date":"2023-03-20T13:34:53.896+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1679319293:896857][116638:0x7f46be029b80], WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 1, snapshot max: 1 snapshot count: 0, oldest timestamp: (1679318801, 3202) , meta checkpoint timestamp: (1679319101, 3202) base write gen: 1070042338"}}
{"t":{"$date":"2023-03-20T13:41:37.518+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"Checkpointer","msg":"WiredTiger message","attr":{"message":"[1679319697:518964][116638:0x7f46adbe4700], WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 10, snapshot max: 10 snapshot count: 0, oldest timestamp: (1679318801, 3202) , meta checkpoint timestamp: (1679319101, 3202) base write gen: 1070042338"}}
// CHECKPOINT: 7m12s
// ...
{"t":{"$date":"2023-03-20T13:42:04.502+00:00"},"s":"I",  "c":"REPL",     "id":21088,   "ctx":"BackgroundSync","msg":"Changed sync source","attr":{"oldSyncSource":"empty","newSyncSource":"shard1repl3.example.com:27017"}}
// STARTUP: 8m19s

 

Notice, the startup takes over 8 minutes, and over 7 minutes of that is spent doing a checkpoint.

8-12 minute startup time results in 10-14 minute restart time, which is unacceptable. It makes rolling restarts for upgrades and other maintenance take 5-7 times longer. And even more importantly, it prevents us from expediently dealing with cluster availability and performance issues.

Is it viable to switch back to 4.x behavior of waiting until startup is complete to perform the first wT checkpoint?

 



 Comments   
Comment by Yuan Fang [ 08/Sep/23 ]

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by Yuan Fang [ 15/Aug/23 ]

Hi ian.springer@salesforce.com, we still need additional information to diagnose the problem. If this is still an issue for you, would you please upload the FTDC and mongod logs that cover the incident of the lengthy startup to this new upload portal? Thank you. 

Comment by Yuan Fang [ 19/Jul/23 ]

Hi ian.springer@salesforce.com,

I apologize for the delay in getting back to you. I understand that you were unable to upload diagnostic data for version 4.2. If the issue of long startup persists in version 5.0, could you please upload the FTDC and mongod logs that cover the incident of the lengthy startup? Since the previous upload portal has expired, I have created a new upload portal for you. Thank you very much for your understanding and support.

Regards,
Yuan
 

Comment by Ian Springer [ 20/Mar/23 ]

Hi Yuan,

 

It is true that the checkpoints are taking longer on 5.0 (and the level of activity for the cluster has increased by about 60% from the time of the 4.2 logs to present). However, it appears the reason for much longer mongod startup times in 5.0 is that the initial checkpoint is being run as part of the startup, versus post-startup in 4.x.

 

Regarding uploading of logs, we would need to redact all customer data from the logs, which would be a huge task, since the logs contain actual queries, as well as database and collection names, which contain the names of our customers. Do you happen to have a utility for redacting all such information from Mongo logs that I can use to redact the logs prior to uploading them?

 

It might be easier if you can tell me the specific logs I could provide that would help you diagnose the issue. Can you do that?

 

Regarding diagnostic metrics, I can upload those for 5.0 if they do not contain any customer data (I will check on that). I will not be able to up upload diagnostic metrics for 4.2, since we only retain diagnostic.data files for a week. I could potentially provide diagnostic metrics from another of our clusters that is still running 4.4, but it would then be an apples to oranges comparison due to the clusters having different # of database and collections, overall data sizes, and shard counts.

 

Thanks!

Ian

Comment by Yuan Fang [ 20/Mar/23 ]

Hi ian.springer@salesforce.com,

Thank you for your report. I understand that you observed longer-running checkpoints at a v5.0 startup compared to a v4.2 startup. I have observed similar issues in the past that were caused by changes in the workload. To determine the reason for the longer checkpoints in v5.0 and answer your question at the end, we require additional information.

I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For the node spanning a time period that includes the incidents:

  • startup in v4.2 which takes less than 2 minutes.
  • startup in v5.0 that takes over 8 minutes.

would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Regards,
Yuan

 

Generated at Thu Feb 08 06:29:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.