[SERVER-58818] Provide lazy loading feature in file manager Created: 24/Jul/21  Updated: 27/Oct/23  Resolved: 28/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: John Moser Assignee: Dmitry Agranat
Resolution: Community Answered Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2021-07-27-23-07-17-944.png    
Participants:

 Description   

Provide lazy loading of the collection and index files at startup.

If there are like xxx'xxx of collections and indexes, the memory consumption at startup is very high compared to runtime when possibly only a few collections/indexes are used.

Further the reading of the metadata of each file uses up quite some time.



 Comments   
Comment by John Moser [ 27/Jul/21 ]

Please close this ticket. Thanks.

Comment by John Moser [ 27/Jul/21 ]

@Daniel - 4.4.3 is as I have described, the files are getting read all and we have the memory situation of 1.

I've reconfigured 4.4.6 again and voilà the file_manager parameters seem to be accepted and working at startup, see 2.

 

 

Sorry - seems there was a configuration error from side. Just want to say thank you to all.

Comment by Daniel Gottlieb (Inactive) [ 27/Jul/21 ]

jamoser42@gmail.com thank you for providing how you were observing that. In the simple scenario (no recovery necessary) on 4.4.6, I was unable to reproduce all files being opened at startup. I'm defining "startup" as the process that happens after starting the mongod process and until process is ready to accept connections. However, on 4.4.3, 4.4.4 and 4.4.5 I did* observe many more files being opened. All user collection files were opened, but not the corresponding indexes. I believe WT-7442 is responsible for why 4.4.6 provided a better result for me. For posterity, the output (on my constrained set of data) when using 4.4.6:

$ ls -ltr /proc/32457/fd | egrep collection\|index
lrwx------ 1 dgottlieb dgottlieb 64 Jul 27 11:16 49 -> /home/dgottlieb/data/db/index-20--72963723002566138.wt
lrwx------ 1 dgottlieb dgottlieb 64 Jul 27 11:16 38 -> /home/dgottlieb/data/db/index-1--72963723002566138.wt
lrwx------ 1 dgottlieb dgottlieb 64 Jul 27 11:16 37 -> /home/dgottlieb/data/db/collection-0--72963723002566138.wt
lrwx------ 1 dgottlieb dgottlieb 64 Jul 27 11:16 33 -> /home/dgottlieb/data/db/collection-10--72963723002566138.wt
lrwx------ 1 dgottlieb dgottlieb 64 Jul 27 11:16 32 -> /home/dgottlieb/data/db/collection-12--72963723002566138.wt
lrwx------ 1 dgottlieb dgottlieb 64 Jul 27 11:16 31 -> /home/dgottlieb/data/db/index-13--72963723002566138.wt
lrwx------ 1 dgottlieb dgottlieb 64 Jul 27 11:16 30 -> /home/dgottlieb/data/db/collection-15--72963723002566138.wt
lrwx------ 1 dgottlieb dgottlieb 64 Jul 27 11:16 24 -> /home/dgottlieb/data/db/collection-14--72963723002566138.wt
$ ls -ltr /data/db/ | egrep collection\|index
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:03 index-38--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:03 index-37--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jun  8 21:03 index-7--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jun  8 21:03 index-5--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:03 index-33--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:03 index-32--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:03 index-30--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:03 index-28--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:03 index-27--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jun  8 21:03 index-26--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:03 index-22--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jun  8 21:03 index-18--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jun  8 21:03 index-16--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jun  8 21:03 index-11--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:03 collection-21--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jun  8 21:07 collection-4--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  36864 Jun  8 21:07 collection-2--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:07 collection-31--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jun  8 21:07 index-9--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jun  8 21:07 index-3--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:07 collection-25--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:07 collection-24--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:07 collection-19--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jun  8 21:07 collection-17--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jun  8 21:07 collection-23--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:07 collection-34--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:08 index-35--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:08 index-36--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jun  8 21:11 index-29--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jul  8 13:52 collection-8--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  36864 Jul  8 14:28 collection-6--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jul 26 13:00 index-2-7491966967589730054.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jul 26 13:00 index-1-7491966967589730054.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jul 26 13:00 collection-0-7491966967589730054.wt
-rw------- 1 dgottlieb dgottlieb  36864 Jul 27 11:16 collection-14--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  36864 Jul 27 11:16 collection-15--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jul 27 11:16 index-13--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jul 27 11:16 collection-12--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  20480 Jul 27 11:16 collection-10--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb   4096 Jul 27 11:16 index-20--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  36864 Jul 27 11:17 index-1--72963723002566138.wt
-rw------- 1 dgottlieb dgottlieb  45056 Jul 27 11:17 collection-0--72963723002566138.wt

>> That said, startup can* sometimes be expensive, but for reasons unrelated to the number of collections and indexes in a MongoDB instance

Using 4.43 and 4.4.6 and with extensive testing, I can say this statement is wrong. If you work with a handful of collections and indexes, then you will not notice it. But use numbers like 100k, 200k, 400k - then you will see memory and startup time is related to the number of files.

I'm sorry you felt that statement was wrong. I believe you've taken it out of context. In case it wasn't clear, I didn't intend to dismiss your claim. I absolutely believe you are seeing all files being opened at startup. My intention was simply to isolate which part of startup is responsible. Allow me to explain.

There are four major components to startup:

  • The storage engine of WiredTiger
  • The MongoDB storage integration layer
  • Replication
  • Sharding (if applicable)

WiredTiger went through significant change in 4.4 to reduce latencies when a lot of data was involved. This resulted in a more complicated recovery procedure. My current understanding is it's only necessary for the recovery process to open a large number of files when there's an unclean shutdown. But there's much nuance and detail I'm unfamiliar with here.

The MongoDB storage integration layer is what was responsible for opening files relevant in SERVER-43664. I also believe this layer opens additional files that scales with the number of index builds running when the process was shutdown (clean or unclean).

The replication layer is responsible for replaying oplog entries to ensure the node presents the same data to queries as when it had shut down. The number of files touched here is proportional to the number of collections that are having data replayed and the number of indexes on those collections. The number of operations that are replayed depends largely on two factors. How long since the last "checkpoint" and the "majority point" of the replica set. A crash is typically bounded by needing to replay one minute's worth of oplog. A replica set without a majority of its members healthy can accept an unbounded number of writes (with backpressure) that will be replayed.

The comment you quoted above was specifically targeted at isolating this case. The cost of recovery where the "majority point" lags no longer has to be paid (unless there's eventually a replication rollback) on 4.4.0 and can be improved, but requires a good deal of work. My apologies for speculatively suggesting this may be what you are observing.

Sharding's recovery process I'm less intimately familiar and I like to think it should have a small footprint. But I'd be glad to be wrong. That would mean the problem is easier to identify Historically however, the prime example of sharding recovery is reconciling the results of an ongoing chunk migration when a process shut down. Technically speaking, I also believe that recovery process happens after a node is talking on the network (or is at least concurrent with the rest of startup) so it shouldn't prolong the time for a node to become available.

Generally speaking, some of the cases we are aware of and some we may not have good clarity on. Likewise, some are simply wasteful and we have plans on addressing and others are more fundamental. Fortunately, the fundamental costs of startup are typically only paid after a crash. My goal in isolating which component is responsible is to make sure the right team is able to classify your observation into those buckets. I'd hate to say "we'll fix this" and reduce unnecessary file handle usage on startup for a different scenario than what's affecting you.

You referred to extensive testing. If that was in the form of experimental scripts that you can share, we'd love to run those to guarantee we're identifying the problem you're observing. Otherwise, the mongod logs and FTDC data showing where you made observations on both 4.4.3 and 4.4.6 would go a long way to giving us the context necessary to reproduce this issue. Do you have any of the above that you are able to share?

Comment by John Moser [ 27/Jul/21 ]

@Daniel

>> That said, startup can* sometimes be expensive, but for reasons unrelated to the number of collections and indexes in a MongoDB instance

Using 4.43 and 4.4.6 and with extensive testing, I can say this statement is wrong. If you work with a handful of collections and indexes, then you will not notice it. But use numbers like 100k, 200k, 400k - then you will see memory and startup time is related to the number of files.

 

Comment by John Moser [ 27/Jul/21 ]

@Daniel

  • ps aux -> pid of mongodb
  • cd /proc/<pid>/fd
  • ls -ltr | grep data | wc -l

then you will get the numbers of the open file handles re. collections and indexes. So it is very clear - all the collections and index files are opened.

I think https://jira.mongodb.org/browse/SERVER-43664 helped a lot to speed up the startup time. But imo it would be even better to have the possibility not to load all the files as explained above and in https://jira.mongodb.org/browse/SERVER-58819.

Comment by Daniel Gottlieb (Inactive) [ 26/Jul/21 ]

jamoser42@gmail.com pertaining to the following claim, can you share any more specifics into what you are observing?

As far as I understand, at startup >all< the files are read.

Prior to SERVER-25025 (released on all 4.4 versions and backported to 3.6), I'm aware that every collection file would be opened to perform a single read. But I otherwise cannot observe every file being read (in their entirety?) at startup, much less being opened. The following is a list of WiredTiger collection + index files that MongoDB is aware of, followed by an strace of open calls that target that same pattern:

dgottlieb@chimichurri ~/xgen/mongo3[master]$ catalog /data/db | egrep 'collection-|index-'
	{u'ident': u'collection-0--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-1--72963723002566138'},
	{u'ident': u'collection-2--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-3--72963723002566138'},
	{u'ident': u'collection-4--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-5--72963723002566138'},
	{u'ident': u'collection-6--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-7--72963723002566138'},
	{u'ident': u'collection-8--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-9--72963723002566138'},
	{u'ident': u'collection-10--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-11--72963723002566138'},
	{u'ident': u'collection-12--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-13--72963723002566138'},
	{u'ident': u'collection-14--72963723002566138',
	{u'ident': u'collection-15--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-16--72963723002566138'},
	{u'ident': u'collection-17--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-18--72963723002566138'},
	{u'ident': u'collection-19--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-20--72963723002566138'},
	{u'ident': u'collection-21--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-22--72963723002566138'},
	{u'ident': u'collection-23--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-26--72963723002566138'},
	{u'ident': u'collection-24--72963723002566138',
	 u'idxIdent': {u'TenantMigrationRecipientTTLIndex': u'index-29--72963723002566138',
	               u'_id_': u'index-27--72963723002566138'},
	{u'ident': u'collection-25--72963723002566138',
	 u'idxIdent': {u'TenantMigrationDonorTTLIndex': u'index-30--72963723002566138',
	               u'_id_': u'index-28--72963723002566138'},
	{u'ident': u'collection-31--72963723002566138',
	 u'idxIdent': {u'ExternalKeysTTLIndex': u'index-33--72963723002566138',
	               u'_id_': u'index-32--72963723002566138'},
	{u'ident': u'collection-34--72963723002566138',
	 u'idxIdent': {u'_id_': u'index-35--72963723002566138',
	               u'a_1': u'index-36--72963723002566138',
	               u'a_1_b_1': u'index-38--72963723002566138',
	               u'b_1': u'index-37--72963723002566138'},
	{u'ident': u'collection-0-7491966967589730054',
	 u'idxIdent': {u'_id_': u'index-1-7491966967589730054',
	               u'lsidTTLIndex': u'index-2-7491966967589730054'},
dgottlieb@chimichurri ~/xgen/mongo3[master]$ egrep 'collection-|index-' strace 
open("/data/db/collection-14--72963723002566138.wt", O_RDWR|O_NOATIME|O_CLOEXEC) = 23
open("/data/db/collection-15--72963723002566138.wt", O_RDWR|O_NOATIME|O_CLOEXEC) = 28
open("/data/db/index-13--72963723002566138.wt", O_RDWR|O_NOATIME|O_CLOEXEC) = 29
open("/data/db/collection-12--72963723002566138.wt", O_RDWR|O_NOATIME|O_CLOEXEC) = 30
open("/data/db/collection-10--72963723002566138.wt", O_RDWR|O_NOATIME|O_CLOEXEC) = 31
open("/data/db/collection-0--72963723002566138.wt", O_RDWR|O_NOATIME|O_CLOEXEC) = 35
open("/data/db/index-1--72963723002566138.wt", O_RDWR|O_NOATIME|O_CLOEXEC) = 36

That said, startup can* sometimes be expensive, but for reasons unrelated to the number of collections and indexes in a MongoDB instance. The main one that comes to mind is the recovery process for replica sets/shards. MongoDB will replay operations in the oplog. The number of operations to replay is roughly equivalent to how many writes in the system have failed to reach a majority of nodes in the replica set when the process was restarted.

Comment by John Moser [ 26/Jul/21 ]

>> What MongoDB version this functionality is requested for

If you are still supporting MongoDB 4.4.x, then it would be nice to have it there

>> Can you also clarify "when possibly only a few collections/indexes are used" or "to operate 8GB are sufficient"? On startup, we load all the data that was in use by the storage engine during the shutdown so it's unclear where the expectation of using only a third (24/8) of this data is sufficient.

As far as I understand, at startup >all< the files are read. That causes the memory usage surge to for example 24GB (in our case). At some point most of the files are getting released (->file_manager params). The problem is we have to make 24GB available on the server which is then never ever used again by mongod until the mongod/replicaset is restarted again. So after the startup and after the file handles are released due to the file_manger, in our case we could run only with a "small" portion of those 24GB.
The solution would be like this - provide another param in file_manger such as lazy_load=true|false. If true, then load the appropriate files (collection, indexes) when the first request hits mongod, which need those files. The first request will be significantly slow but that is in our case acceptable.

In my opinion this feature would be appreciated by a lot of other users too.

Comment by Dmitry Agranat [ 26/Jul/21 ]

What MongoDB version this functionality is requested for? Can you also clarify "when possibly only a few collections/indexes are used" or "to operate 8GB are sufficient"? On startup, we load all the data that was in use by the storage engine during the shutdown so it's unclear where the expectation of using only a third (24/8) of this data is sufficient.

Comment by John Moser [ 24/Jul/21 ]

PS: Just as an example :

  • to read all the files it takes 24GB of memory
  • to operate 8GB are sufficient
    => 16 GB are wasted
    If you run multiple instances of mongod's as replicasets this quite adds up.
Generated at Thu Feb 08 05:45:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.