[SERVER-33845] Important memory leak in bulkWrite via mongo shell Created: 13/Mar/18  Updated: 27/Oct/23  Resolved: 18/Mar/18

Status: Closed
Project: Core Server
Component/s: Shell
Affects Version/s: 3.4.10, 3.6.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Henri-Maxime Ducoulombier Assignee: Dmitry Agranat
Resolution: Works as Designed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

I've been writing a script to browse a collection with an invalid (bad design) data structure and rewrite it in another collection, and I encounter a massive memory leak when running the script in mongo shell.

Here is the sample script (I removed all that is not required and renamed variables):

var srcCollection = db.getCollection('source'),
    dstCollection = db.getCollection('destination'),
    updates = [],
    batchSize = 200,
    counter = 0,
    limit = 0,
    flush = true,
    bulkOptions = {"writeConcern": {"w": 1}, "ordered": false},
    cursor = srcCollection.find({}).batchSize(batchSize).noCursorTimeout(),
    c = null,
    sl = 0,
    ol = 0,
    cl = 0,
    newDoc = null,
    n1, n2, n3;
 
if (limit > 0)
    cursor.limit(limit);
 
if (flush) {
    dstCollection.drop();
    printjson('Collection ' + dstCollection + ' has been dropped before processing...');
}
 
cursor.forEach(function (doc) {
    // Copy source document
    newDoc = Object.assign({}, doc);
    // Remove obsolete keys
    delete newDoc['mbz'];
    delete newDoc['_id'];
 
    newDoc['mbz'] = {
        'field1': [],
        'field2': [],
        'field3': []
    };
 
    if (doc.mbz) {
        doc.mbz.forEach(function (elt) {
            c = elt.c_id;
            sl = elt.sds.length;
 
            for (n1 = 0;n1 < sl; n1++){
 
                ol = elt.sds[n1]['o'].length;
                cl = elt.sds[n1]['c'].length;
 
                newDoc['mbz']['field1'].push({
                    'c': c,
                    'd': elt.sds[n1].d,
                    'x': elt.sds[n1].s_id
                });
 
                for (n2 = 0;n2 < ol; n2++) {
                    newDoc['mbz']['field2'].push({
                        'c': c,
                        'd': elt.sds[n1]['o'][n2].d,
                        'x': elt.sds[n1].s_id
                    });
                }
 
                for (n3 = 0;n3 < cl; n3++) {
                    newDoc['mbz']['field3'].push({
                        'c': c,
                        'd': elt.sds[n1]['c'][n3].d,
                        'l': elt.sds[n1]['c'][n3].l_id,
                        'x': elt.sds[n1].s_id
                    });
                }
                
            }
        });
    }
 
    updates.push({
        'insertOne':{
            "document" : newDoc
        }
    });
 
    counter++;
 
    if (updates.length >= batchSize) {
        // I tried bulkWrite, insertMany and initializeUnorderedBulkOp too
        dstCollection.bulkWrite(updates, bulkOptions);
        printjson('-- ' + counter + ' documents transfered so far...');
        updates = [];
    }
});
 
if (updates.length > 0) {
    dstCollection.bulkWrite(updates, bulkOptions);
}
 
printjson('----- Total: ' + counter + ' documents transfered');	

This is rather bruteforce, but it is meant to be done only once and was meant to be written quickly. Also, this works perfectly (and faster) in Python with Pymongo 3.4 or 3.6.

Now, the script is leaking only if bulkWrite operations are done. If I comment the lines ( dstCollection.bulkWrite(updates, bulkOptions); ), no write operation is done and I have no memory leak, even if the cursor is browsed until the end.

The collection is rather small (16000 docs) but the documents have an average size of 21K (source collection is 540Mo large, destination collection built using python is 330Mo). The leak is growing after each bulkWrite (about every 3 seconds), and adds 15 to 25Mo of memory to the "mongo" process (not MongoDB).



 Comments   
Comment by Dmitry Agranat [ 18/Mar/18 ]

Hi hmducoulombier@marketing1by1.com,

Glad to hear that adding a call to gc() in the forEach fixed this issue. This means it's not a memory leak, just an accumulation of memory that hasn't yet been garbage collected.

Regards,
Dima

Comment by Henri-Maxime Ducoulombier [ 18/Mar/18 ]

Hi Dmitry,

Just tried it and it seems to fix the memory leak issue.

I put a call to gc() right after the updates = []; after the bulkWrite in loop.

Henri-Maxime

Comment by Dmitry Agranat [ 18/Mar/18 ]

Hi hmducoulombier@marketing1by1.com,

Thank you for the report.

Could you please check whether adding a call to gc() in the forEach fixes this issue?

Thanks,
Dima

Generated at Thu Feb 08 04:34:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.