[SERVER-40703] Resource leakage + slowness after creating and deleting few thousands of DBs Created: 17/Apr/19  Updated: 06/Dec/22  Resolved: 26/Jun/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.11
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: assaf-oren Assignee: Backlog - Storage Engines Team
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows & Linux


Attachments: PNG File 3.6.11-30000_cycles-disk_storage-WiredTiger.png     PNG File Screen Shot 2019-04-18 at 4.39.37 PM.png     PNG File dhandles.png     PNG File heap.png     File test.js     File x-4110.tgz     File x-in-memory.tgz    
Issue Links:
Duplicate
duplicates SERVER-48213 database creation get slower for eac... Closed
Related
related to SERVER-40925 Investigate if checkpoints, sweep and... Backlog
related to WT-4513 Investigate improvements in session's... Closed
is related to SERVER-48213 database creation get slower for eac... Closed
Assigned Teams:
Storage Engines
Participants:
Story Points: 5

 Description   

Creating new DB + collections + indexes and deleting it becomes slow after few thousands of cycles.

The mongo process memory keeps growing although the DBs are deleted so only a single DB is present at any given time. At some point this cycle become more and more slower. It seems around the memory consumption of ~3.5GB.

After stopping the script, the mongo process keeps consuming large amount of CPU and Memory (forever).
If script will be run again, memory will get even higher and also the cycle time.

 

See a simple script to reproduce the issue:

https://github.com/assaf-xm/mongodb-create-drop-dbs/blob/master/test.js

 

And results:

https://github.com/assaf-xm/mongodb-create-drop-dbs/blob/master/3.6.11-in-memory.png

https://github.com/assaf-xm/mongodb-create-drop-dbs/blob/master/3.6.11-regular.png

 

At start timings are fast - 201ms:

2019-04-17T12:11:25.645Z start 3
2019-04-17T12:11:25.654Z connected 3
2019-04-17T12:11:25.655Z collection 0
2019-04-17T12:11:25.672Z collection 1
2019-04-17T12:11:25.690Z collection 2
2019-04-17T12:11:25.706Z collection 3
2019-04-17T12:11:25.726Z collection 4
2019-04-17T12:11:25.742Z collection 5
2019-04-17T12:11:25.760Z collection 6
2019-04-17T12:11:25.779Z collection 7
2019-04-17T12:11:25.796Z collection 8
2019-04-17T12:11:25.812Z collection 9
2019-04-17T12:11:25.828Z drop collection 0
2019-04-17T12:11:25.830Z drop collection 1
2019-04-17T12:11:25.832Z drop collection 2
2019-04-17T12:11:25.833Z drop collection 3
2019-04-17T12:11:25.835Z drop collection 4
2019-04-17T12:11:25.836Z drop collection 5
2019-04-17T12:11:25.838Z drop collection 6
2019-04-17T12:11:25.840Z drop collection 7
2019-04-17T12:11:25.841Z drop collection 8
2019-04-17T12:11:25.843Z drop collection 9
2019-04-17T12:11:25.845Z drop 3
2019-04-17T12:11:25.846Z finish 3 | 201

 

But after 8800 cycles DB operations are much slower - 685ms:

2019-04-17T12:41:20.397Z start 8800
2019-04-17T12:41:20.402Z connected 8800
2019-04-17T12:41:20.402Z collection 0
2019-04-17T12:41:20.486Z collection 1
2019-04-17T12:41:20.557Z collection 2
2019-04-17T12:41:20.642Z collection 3
2019-04-17T12:41:20.716Z collection 4
2019-04-17T12:41:20.773Z collection 5
2019-04-17T12:41:20.831Z collection 6
2019-04-17T12:41:20.877Z collection 7
2019-04-17T12:41:20.916Z collection 8
2019-04-17T12:41:20.959Z collection 9
2019-04-17T12:41:20.999Z drop collection 0
2019-04-17T12:41:21.003Z drop collection 1
2019-04-17T12:41:21.009Z drop collection 2
2019-04-17T12:41:21.018Z drop collection 3
2019-04-17T12:41:21.027Z drop collection 4
2019-04-17T12:41:21.036Z drop collection 5
2019-04-17T12:41:21.042Z drop collection 6
2019-04-17T12:41:21.052Z drop collection 7
2019-04-17T12:41:21.061Z drop collection 8
2019-04-17T12:41:21.071Z drop collection 9
2019-04-17T12:41:21.080Z drop 8800
2019-04-17T12:41:21.082Z finish 8800 | 685

 



 Comments   
Comment by Brian Lane [ 26/Jun/20 ]

Thanks assaf@xmcyber.com - great to hear.

-Brian

Comment by assaf-oren [ 25/Jun/20 ]

Following our latest conversation with Kian Gogan, I checked this with mongo 4.2.3 and the issue doesn't seem to reproduce.

Therefore I think we can close this ticket.

Comment by assaf-oren [ 17/Feb/20 ]

I can see that this fix was delayed.

Is there any workaround that can be done using commands to prevent restarting the DB when the memory gets high (and performance gets low)?

 

Comment by Bruce Lucas (Inactive) [ 25/Apr/19 ]

The issue reproduces on 4.1.10, and with WT (not just in-memory). Here's a run on 4.1.10, standalone, WT, cache size 0.2 GB, with the heap profiler enabled:

When the cache reached 95% the "bytes not belonging to page images" were finally freed (evicted?), and based on the behavior of the heap profile these seem like likely contributors:

2019-04-25T16:53:55.096+0000 I  -        [ftdc] heapProfile stack307: { 0: "tc_calloc", 1: "__wt_calloc", 2: "__wt_update_alloc", 3: "__wt_row_modify", 4: "__wt_btcur_insert", 5: "0x5616119bbfef", 6: "__wt_metadata_update", 7: "0x5616119f363f", 8: "__wt_meta_ckptlist_set", 9: "0x561611948791", 10: "__wt_checkpoint_close", 11: "__wt_conn_dhandle_close", 12: "__wt_session_release_dhandle", 13: "0x5616119bd668", 14: "mongo::WiredTigerIndex::StandardBulkBuilder::~StandardBulkBuilder", 15: "mongo::AbstractIndexAccessMethod::commitBulk", 16: "mongo::MultiIndexBlock::dumpInsertsFromBulk", 17: "mongo::MultiIndexBlock::insertAllDocumentsInCollection", 18: "0x561611d48659", 19: "mongo::ErrmsgCommandDeprecated::run", 20: "mongo::BasicCommand::Invocation::run", 21: "0x561611a94d15", 22: "0x561611a9637c", 23: "0x561611a971fe", 24: "mongo::ServiceEntryPointCommon::handleRequest", 25: "mongo::ServiceEntryPointMongod::handleRequest", 26: "mongo::ServiceStateMachine::_processMessage", 27: "mongo::ServiceStateMachine::_runNextInGuard", 28: "0x561611a9059d", 29: "mongo::transport::ServiceExecutorSynchronous::schedule", 30: "mongo::ServiceStateMachine::_scheduleNextWithGuard", 31: "mongo::ServiceStateMachine::_sourceCallback", 32: "mongo::ServiceStateMachine::_sourceMessage", 33: "mongo::ServiceStateMachine::_runNextInGuard", 34: "0x561611a9059d", 35: "0x56161294253b", 36: "0x561612f6d3d4", 37: "0x7f1c25cb8dd5", 38: "clone" }
2019-04-25T16:53:53.227+0000 I  -        [ftdc] heapProfile stack245: { 0: "tc_calloc", 1: "__wt_calloc", 2: "__wt_update_alloc", 3: "__wt_row_modify", 4: "__wt_btcur_insert", 5: "0x5616119bbfef", 6: "__wt_metadata_insert", 7: "__wt_schema_create", 8: "0x56161191e930", 9: "__wt_schema_create", 10: "__wt_session_create", 11: "0x561611931a0c", 12: "mongo::WiredTigerKVEngine::makeTemporaryRecordStore", 13: "mongo::KVStorageEngine::makeTemporaryRecordStore", 14: "mongo::IndexBuildInterceptor::IndexBuildInterceptor", 15: "mongo::IndexCatalogImpl::IndexBuildBlock::init", 16: "mongo::MultiIndexBlock::init", 17: "0x561611d44d2a", 18: "0x561611d48d40", 19: "mongo::ErrmsgCommandDeprecated::run", 20: "mongo::BasicCommand::Invocation::run", 21: "0x561611a94d15", 22: "0x561611a9637c", 23: "0x561611a971fe", 24: "mongo::ServiceEntryPointCommon::handleRequest", 25: "mongo::ServiceEntryPointMongod::handleRequest", 26: "mongo::ServiceStateMachine::_processMessage", 27: "mongo::ServiceStateMachine::_runNextInGuard", 28: "0x561611a9059d", 29: "mongo::transport::ServiceExecutorSynchronous::schedule", 30: "mongo::ServiceStateMachine::_scheduleNextWithGuard", 31: "mongo::ServiceStateMachine::_sourceCallback", 32: "mongo::ServiceStateMachine::_sourceMessage", 33: "mongo::ServiceStateMachine::_runNextInGuard", 34: "0x561611a9059d", 35: "0x56161294253b", 36: "0x561612f6d3d4", 37: "0x7f1c25cb8dd5", 38: "clone" }
2019-04-25T16:53:53.122+0000 I  -        [ftdc] heapProfile stack206: { 0: "tc_calloc", 1: "__wt_calloc", 2: "__wt_update_alloc", 3: "__wt_row_modify", 4: "__wt_btcur_insert", 5: "0x5616119bbfef", 6: "__wt_metadata_insert", 7: "__wt_schema_create", 8: "0x56161191e930", 9: "__wt_schema_create", 10: "__wt_session_create", 11: "0x561611931a0c", 12: "mongo::WiredTigerIndex::Create", 13: "mongo::WiredTigerKVEngine::createGroupedSortedDataInterface", 14: "mongo::KVCollectionCatalogEntry::prepareForIndexBuild", 15: "mongo::IndexCatalogImpl::IndexBuildBlock::init", 16: "mongo::IndexCatalogImpl::createIndexOnEmptyCollection", 17: "mongo::DatabaseImpl::createCollection", 18: "0x561611d449c3", 19: "0x561611d48c81", 20: "mongo::ErrmsgCommandDeprecated::run", 21: "mongo::BasicCommand::Invocation::run", 22: "0x561611a94d15", 23: "0x561611a9637c", 24: "0x561611a971fe", 25: "mongo::ServiceEntryPointCommon::handleRequest", 26: "mongo::ServiceEntryPointMongod::handleRequest", 27: "mongo::ServiceStateMachine::_processMessage", 28: "mongo::ServiceStateMachine::_runNextInGuard", 29: "0x561611a9059d", 30: "mongo::transport::ServiceExecutorSynchronous::schedule", 31: "mongo::ServiceStateMachine::_scheduleNextWithGuard", 32: "mongo::ServiceStateMachine::_sourceCallback", 33: "mongo::ServiceStateMachine::_sourceMessage", 34: "mongo::ServiceStateMachine::_runNextInGuard", 35: "0x561611a9059d", 36: "0x56161294253b", 37: "0x561612f6d3d4", 38: "0x7f1c25cb8dd5", 39: "clone" }

This event however did not delete data handles, and application performance impact due to "dhandle lock application thread time waiting" continued growing.

Comment by Bruce Lucas (Inactive) [ 25/Apr/19 ]

Simple shell-based repro:

ncolls = 10
niter = 2000
 
function repro() {
    for (let i = 0; i < niter; i++) {
        if (i%100==0)
            print('iteration', i)
        for (let j = 0; j < ncolls; j++) {
            c = db['c'+j]
            c.createIndex({a: 1, b: 1, c: 1});
            c.createIndex({b: 1, a: 1, c: 1});
            c.createIndex({c: 1, b: 1, a: 1});
            c.createIndex({a: 1, c: 1, b: 1});
            c.createIndex({b: 1, c: 1, a: 1});
            c.createIndex({c: 1, a: 1, b: 1});
            c.insert({a: 1, b: 2, c: 3});
        }
        for (let j = 0; j < ncolls; j++) {
            c = db['c'+j]
            c.drop()
        }
    }
}

Following run uses the in-memory storage engine, 3.6.11, running standalone, with a cache size (--inMemorySizeGB) of 1 GB.

  • Even though the application drops all collections that it creates, we see constant growth in "data handles currently active" and "bytes not belonging to page images in the cache"
  • Table create and table drop calls closely match, so I think the problem is in the storage engine, not the integration layer.
  • At A cache reaches 5% dirty (due to "bytes not belonging to page images in the cache"), evictions begin, and performance drops. This is accounted for by "dhandle lock application thread time waiting".
  • After evictions begin at A the profile shows ~1 thread constantly active in __evict_walk @evict_lru.c:1432 (highlighted in blue), which is this line of code:

    		/* Ignore non-btree handles, or handles that aren't open. */
    		if (dhandle->type != WT_DHANDLE_TYPE_BTREE ||
    		    !F_ISSET(dhandle, WT_DHANDLE_OPEN))
    			continue;
    

    Does this mean that the eviction server is spending a lot of its time skipping over handles that aren't open?

  • Shortly after A "data handles currently active" begins to rise
  • Data handles and bytes not belonging to page images never goes down, even after workload ends.
Comment by assaf-oren [ 20/Apr/19 ]

Also uploaded the diagnostic data of the last few days to the secure upload portal. 

Comment by assaf-oren [ 20/Apr/19 ]

Hi,

 

I did one more run of the script letting it run for hours (30000 cycles). Used disk storage. See the file '3.6.11-30000_cycles-disk_storage-WiredTiger.png'

As you can see the latency starts to grow after ~14000 cycles, climbing up to 1100ms, but it drop from time to time back to ~200ms.

 

Thanks,

Assaf

Comment by Eric Sedor [ 19/Apr/19 ]

Yes Assaf, you can use this secure upload portal. Files uploaded here are visible only to MongoDB employees and deleted after some time.

Comment by assaf-oren [ 19/Apr/19 ]

Hi Eric,

For your question, the memory growth was seen both for the in-memory and disk storage engines.

In the in-memory it is faster to reproduce the point in which a cycle (creating and delete DB and collections) latency starts to grow as well. When it starts to grow it continue to grow as you can see from the graphs and DB CPU will continue to be high even when the script is stopped.

(for the in-memory the creation and deletion of DBs are ~7 times faster, so the issue can be faster reproduced)

I will run the script again and provide you with the diagnostic.data, can you provide a secured private location for these files?

Thanks,

Assaf 

Comment by Eric Sedor [ 18/Apr/19 ]

Thanks for your report assaf-oren. Can you clarify: Are you reporting this unexpected memory growth using the in-memory storage engine?

We are testing your script internally but it will also be helpful if you can archive (tar or zip) the $dbpath/diagnostic.data directory for your test cluster (described here) and attach it to this ticket.

Generated at Thu Feb 08 04:55:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.