[SERVER-58819] Release of files after close_idle_time should not block the system Created: 26/Jul/21  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: John Moser Assignee: Backlog - Storage Engines Team
Resolution: Unresolved Votes: 0
Labels: refinement
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Storage Engines
Participants:
Story Points: 8

 Description   

Setup : 200'000 collection and index files, HDD

After close_idle_time it seems all the open files are closed at once so that the liveness and readiness probe fail.

Request : could the closing done in a more "soft" way, so that other processes on the OS, who need to access the disk are not blocked ?

 



 Comments   
Comment by Dmitry Agranat [ 29/Jul/21 ]

Thank you jamoser42@gmail.com for opening a new feature request. We're assigning this ticket to the appropriate team to be evaluated against our currently planned work. Updates will be posted on this ticket as they happen.

Comment by John Moser [ 26/Jul/21 ]

This thicket is closely related to https://jira.mongodb.org/browse/SERVER-58818

So at startup all the files are getting read and then after some idle time 200k - 250 files are more or less getting released at once. This causes a huge stress on the systems IO so that for example the liveness and readiness probe (could) fail. Worst case when using kubernetes the mongod pods get restarted.

Imho I dont think manual adjustments like above in Production environment would acceptable.

If there is just a sleep(10ms) or so between each close, this could help to make the system more responsive during such an event. Or even better, not to read all the files at startup ...

Comment by Dmitry Agranat [ 26/Jul/21 ]

Hi jamoser42@gmail.com,

If all (or most of) the 200k tables are being closed, it means that your application has a large number of tables (a table is either a collection or an index) that only regularly access a relatively small proportion of them. Instead of waiting for the default 100000 seconds while accumulating hundreds of thousands of tables, you can try closing them more gradually with the following parameter:

db.adminCommand({setParameter: 1, wiredTigerEngineRuntimeConfig: "file_manager (close_idle_time=30,close_scan_interval=30,close_handle_minimum=100)" })

Please note that the default is file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250)

Regards,
Dima

Generated at Thu Feb 08 05:45:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.