[SERVER-58873] Compaction procedure for small collection takes a significant amount of time Created: 27/Jul/21  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Basil Markov Assignee: Backlog - Storage Engines Team
Resolution: Unresolved Votes: 0
Labels: refinement
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Storage Engines
Participants:
Story Points: 8

 Description   

Hi there, I have some questions about the compaction procedure timing.

Our pre-requisites was:

1. We started compaction first on our smallest collection, let's say collectionX size of

 

"size" : 9624660994, 
"count" : 22641357, 
"avgObjSize" : 425, 
"storageSize" : 13472161792, 
"freeStorageSize" : 10151288832

 

Also, we pre-checked the potential amount of disk resource we can free:

db.collectionX.stats().wiredTiger["block-manager"]["file bytes available for reuse"] 9945591808

 

So, with that in mind we started compaction by running this command:

db.runCommand({compact: "collectionX"}) 
source: host04:27017 syncedTo: Fri Jul 16 2021 00:23:20 GMT+0300 (MSK) 
45 secs (0.01 hrs) behind the primary

 

At pretty close to the same time we saw this message in mongo.log file:

{"t":{"$date":"2021-07-16T00:23:20.875+03:00"},"s":"I", "c":"REPL", "id":21350, "ctx":"conn3866106","msg":"Going into maintenance mode","attr":{"otherMaintenanceModeTasksInProgress":0}} 
 
{"t":{"$date":"2021-07-16T00:23:20.876+03:00"},"s":"I", "c":"REPL", "id":21358, "ctx":"conn3866106","msg":"Replica set state transition","attr":{"newState":"RECOVERING","oldState":"SECONDARY"}} 
 
{"t":{"$date":"2021-07-16T00:23:20.911+03:00"},"s":"I", "c":"COMMAND", "id":20284, "ctx":"conn3866106","msg":"Compact begin","attr":{"namespace":"collectionX"}}

 

That's exactly the time when compaction started, and then
......

{"t":{"$date":"2021-07-17T14:01:15.957+03:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn3866106","msg":"WiredTiger message","attr":{"message":"[1626519675:957490][104617:0x7fcaf60e0700], file:index-collectionX.wt, WT_SESSION.compact: [WT_VERB_COMPACT_PROGRESS] Compact running for 2253 seconds; reviewed 787078 pages, skipped 638436 pages, wrote 148642 pages"}}
 
{"t":{"$date":"2021-07-17T14:03:33.200+03:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn3866106","msg":"WiredTiger message","attr":{"message":"[1626519813:200226][104617:0x7fcaf60e0700], WT_SESSION.compact: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 25385578866, snapshot max: 25385578866 snapshot count: 0, oldest timestamp: (1626384195, 1628) , meta checkpoint timestamp: (1626384200, 1628)"}}

......

[COMPACT_PROGRESS] Compact running for 2253 seconds; reviewed 787078 pages, skipped 638436 pages, wrote 148642 pages"}}
 
{"t":{"$date":"2021-07-17T14:05:53.371+03:00"},"s":"I", "c":"STORAGE", "id":20286, "ctx":"conn3866106","msg":"Compact end","attr":{"namespace":"collectionX","freedBytes":26708127744}}

 

It took smth like 38 hours to get the job done for this collection and indexes and additional 25 hours to catch up with the master.

We haven't seen any anomalies or spikes in CPU/MEM or disk usage consumption during compaction, just some asserts in our monitoring, but not crucially, especially compared to average workday metrics.

As for the result - it's pretty amazing, the final size of our collection on disk became 3204157440 (4 times less than we had had before compaction).

But the main question is - is that a normal time for compaction procedure for this collection size(~13G)?

How can we calculate the potential amount of time compaction may last?

Maybe we can do it based on some objectives or average load metrics/total collections size on disks/whatever?

Because we were slightly frustrated about a 38h compaction for such a small amount of disk space.



 Comments   
Comment by Dmitry Agranat [ 19/Sep/21 ]

Thank you for the latest data haltandcatchfire91@gmail.com. We're assigning this ticket to the appropriate team to be further investigated based on our findings so far. Updates will be posted on this ticket as they happen.

Comment by Basil Markov [ 15/Sep/21 ]

Hi Dmitry Agranat. 
I've just uploaded the collStats output as .json file, I hope it helps investigate the problem.
One note - this collStats data is from a collection before the compaction procedure. 

Comment by Dmitry Agranat [ 13/Sep/21 ]

Hi haltandcatchfire91@gmail.com, for completeness, could you upload the output from collStats for the irkkt.REG.checks_schedule collection? I'd like to review storage stats including all the indexes.

Comment by Basil Markov [ 10/Sep/21 ]

Hi Dmitry Agranat. 

Is there any kind of update about FTDC and compaction procedure?

Comment by Dmitry Agranat [ 05/Sep/21 ]

Thanks haltandcatchfire91@gmail.com, I am looking at the provided data and cannot see anything unusual that might indicate an issue. However, because the compact is done in the "maintenance mode", we also do not collect any FTDC. Having visibility into some FTDC metrics might have helped us to spot any potential inefficiency.

I am going to investigate this a bit further to see if we can turn on diagnostic data gathering during the compact command.

Comment by Basil Markov [ 01/Sep/21 ]

Hi Dmitry Agranat
I've finally uploaded all the necessary data (diag, log files, etc).
Here is relevant information about our second compaction attempt.
We restored the collection from the latest dump before the first compaction and used it again, so the collection was the same.

SECONDARY> db.runCommand({compact: "REG.checks_schedule"}) { "bytesFreed" : 5718319104, "ok" : 1 }

 
{"t":{"$date":"2021-08-26T10:27:37.729+03:00"},"s":"I", "c":"COMMAND", "id":20284, "ctx":"conn47","msg":"Compact begin","attr":{"namespace":"irkkt.REG.checks_schedule"}} 
... 
{"t":{"$date":"2021-08-26T21:25:25.540+03:00"},"s":"I", "c":"STORAGE", "id":20286, "ctx":"conn47","msg":"Compact end","attr":{"namespace":"irkkt.REG.checks_schedule","freedBytes":5718319104}} 
{"t":{"$date":"2021-08-26T21:25:25.540+03:00"},"s":"I", "c":"REPL", "id":21351, "ctx":"conn47","msg":"Leaving maintenance mode","attr":{"otherMaintenanceModeTasksOngoing":0}}

Comment by Dmitry Agranat [ 01/Sep/21 ]

Hi haltandcatchfire91@gmail.com, here is the new uploader link. Please add all the relevant information (similar to what you did before, "file bytes available for reuse" before compact, name of the collection, timestamp of the compact command start and end) in the next comment once the data is uploaded.

Comment by Basil Markov [ 01/Sep/21 ]

Dmitry Agranat can you please share a new upload link and reopen this issue just to prevent creating duplicates?

Comment by Basil Markov [ 30/Aug/21 ]

Hi!
I finally have all data from our latest compaction procedure.
Can you please share a new upload link?

Comment by Dmitry Agranat [ 29/Aug/21 ]

Hi haltandcatchfire91@gmail.com,

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Dima

Comment by Basil Markov [ 12/Aug/21 ]

Hi!
That's steel an issue for our team, but our last diagnostic data has already been rotated, so we are waiting for a time frame for another compaction session to be run.
I think I'll return with a full pack of data and fresh examples by the end of the next working week.

Comment by Dmitry Agranat [ 12/Aug/21 ]

Hi haltandcatchfire91@gmail.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please upload the requested data into this support uploader?

Thanks,
Dima

Comment by Dmitry Agranat [ 28/Jul/21 ]

Hi haltandcatchfire91@gmail.com, we would need some additional information in order to understand why it took 38 hours to compact this collection.

Would you please archive (tar or zip) the mongod.log files covering the incident and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Regards,
Dima

Generated at Thu Feb 08 05:45:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.