[SERVER-35642] We're seeing increasing memory usage on secondary members. Created: 18/Jun/18  Updated: 27/Oct/23  Resolved: 22/Aug/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.6.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ivan Filimonov Assignee: William Schultz (Inactive)
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File memory.png     PNG File mongodb-leak.png     PNG File screenshot-1.png     PNG File screenshot-2.png    
Issue Links:
Depends
depends on SERVER-35941 Don't maintain full stable optime can... Closed
Operating System: ALL
Participants:

 Description   

Basically, seeing the same symptoms as in https://jira.mongodb.org/browse/SERVER-26312
after one of the apps using mongodb has been updated. Memory usage on primary is reasonable and stable. On secondary - too high and grows all the time. The only change in app I think could've caused it - one of the connections in updated app uses secondaryPreferred mode.

On the attached CloudWatch chart blue line is master, orange is secondary. The third node isn't monitored atm, but it's res memory usage ( {{ db.serverStatus().mem.resident }} ) was around 2833Gb, which is very close to another secondary (orange).

Memory allocated for each node - 4Gb (t2-medium). 
This is server build information from master node, looks very similar to others.

{ 
    "version" : "3.6.4", 
    "gitVersion" : "d0181a711f7e7f39e60b5aeb1dc7097bf6ae5856", 
    "modules" : [    ], 
    "allocator" : "tcmalloc", 
    "javascriptEngine" : "mozjs", 
    "sysInfo" : "deprecated", 
    "versionArray" : [
        NumberInt(3), 
        NumberInt(6), 
        NumberInt(4), 
        NumberInt(0)
    ], 
    "openssl" : {
        "running" : "OpenSSL 1.0.0-fips 29 Mar 2010", 
        "compiled" : "OpenSSL 1.0.1e-fips 11 Feb 2013"
    }, 
    "buildEnvironment" : {
        "distmod" : "amazon", 
        "distarch" : "x86_64", 
        "cc" : "/opt/mongodbtoolchain/v2/bin/gcc: gcc (GCC) 5.4.0", 
        "ccflags" : "-fno-omit-frame-pointer -fno-strict-aliasing -ggdb -pthread -Wall -Wsign-compare -Wno-unknown-pragmas -Winvalid-pch -Werror -O2 -Wno-unused-local-typedefs -Wno-unused-function -Wno-deprecated-declarations -Wno-unused-but-set-variable -Wno-missing-braces -fstack-protector-strong -fno-builtin-memcmp", 
        "cxx" : "/opt/mongodbtoolchain/v2/bin/g++: g++ (GCC) 5.4.0", 
        "cxxflags" : "-Woverloaded-virtual -Wno-maybe-uninitialized -std=c++14", 
        "linkflags" : "-pthread -Wl,-z,now -rdynamic -Wl,--fatal-warnings -fstack-protector-strong -fuse-ld=gold -Wl,--build-id -Wl,--hash-style=gnu -Wl,-z,noexecstack -Wl,--warn-execstack -Wl,-z,relro", 
        "target_arch" : "x86_64", 
        "target_os" : "linux"
    }, 
    "bits" : NumberInt(64), 
    "debug" : false, 
    "maxBsonObjectSize" : NumberInt(16777216), 
    "storageEngines" : [
        "devnull", 
        "ephemeralForTest", 
        "mmapv1", 
        "wiredTiger"
    ], 
    "ok" : 1.0, 
    "operationTime" : Timestamp(1529327365, 839), 
    "$clusterTime" : {
        "clusterTime" : Timestamp(1529327365, 839), 
        "signature" : {
            "hash" : BinData(0, "AAAAAAAAAAAAAAAAAAAAAAAAAAA="), 
            "keyId" : NumberLong(0)
        }
    }
} 

Hope it helps. Please, let me know if you need more information.



 Comments   
Comment by Tess Avitabile (Inactive) [ 22/Aug/18 ]

Fixed by SERVER-35941

Comment by Ivan Filimonov [ 23/Jul/18 ]

Well it's not bugging me anymore, if you guys aren't going to fix it - then yeah. I'd suggest to at least mention somewhere in documentation that users need to update their databases if they were created with the old protocol version.

Comment by William Schultz (Inactive) [ 23/Jul/18 ]

sunrize531@gmail.com Can this ticket be closed? See SERVER-35821.

Comment by Bruce Lucas (Inactive) [ 28/Jun/18 ]

Thanks for the confirmation Ivan, and thank you for reporting this issue.

Comment by Ivan Filimonov [ 27/Jun/18 ]

Yeah this helped quite a bit.

Comment by Ivan Filimonov [ 26/Jun/18 ]

It's currently going down, slowly, but it's like it was all the time. It goes up, then down, then regains what it freed and gets little more.... Will update in few hours.

Comment by Bruce Lucas (Inactive) [ 26/Jun/18 ]

I think it may, but I'm not certain. Are you able to see a change in your monitoring? If you would like to upload a fresh copy of diagnostic.data I can also take a look.

Comment by Ivan Filimonov [ 26/Jun/18 ]

Done. Would it free excess memory after protocol version change or do I restart the member?

Comment by Bruce Lucas (Inactive) [ 26/Jun/18 ]

Hi Ivan, thanks for uploading the data. It shows an accumulation of about 1 GB of excess allocated memory over the course of about 6 days:

The heap profiler indicates that that memory is allocated by the following call stack:

2018-06-19T11:21:33.030+0100 I -        [ftdc] heapProfile stack566: { 0: "operator new[]", 1: "std::pair<std::_Rb_tree_iterator<mongo::repl::OpTime>, bool> std::_Rb_tree<mongo::repl::OpTime, mongo::repl::OpTime, std::_Identity<mongo::repl::OpTim...", 2: "mongo::repl::ReplicationCoordinatorImpl::_setMyLastAppliedOpTime_inlock", 3: "mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward", 4: "0x5632b66a6e50", 5: "mongo::repl::SyncTail::oplogApplication", 6: "mongo::repl::RSDataSync::_run", 7: "0x5632b7bf9040", 8: "0x7fa760440de5", 9: "clone" }

The log file indicates that you are using protocol version 0 (pv0), and I suspect that the behavior you are seeing could be a result of that. Can you try upgrading to pv1 to see if this addresses the problem?

Thanks,
Bruce

Comment by Ivan Filimonov [ 26/Jun/18 ]

@Bruce Lucas, it's just entered the alarm state again. Seems little sooner then last time... Any updates on it?

Comment by Ivan Filimonov [ 25/Jun/18 ]

Hi

I have uploaded the logs and diagnostic data, using the url above. The files start with SERVER-35642-.
Let me know if you need anything else.

Ivan

Comment by Ivan Filimonov [ 20/Jun/18 ]

So, just as a quick update - after a day since I restarted the node, it's clear that the issue is still being observed. I will update again when it hits alarm.

Comment by Ivan Filimonov [ 19/Jun/18 ]

I'm getting error message when trying to start it with profiler:
Works fine, been trying to restart on the wrong instance running older version.

Comment by Ivan Filimonov [ 19/Jun/18 ]

Will do

Comment by Bruce Lucas (Inactive) [ 18/Jun/18 ]

Thanks Ivan, you are correct, looks like it's unrelated to index builds.

Would you be willing to restart the mongod on node-e with the heap profiler enabled? To do this you would set the heapProfilingEnabled parameter at startup; for example, on the command line:

mongod --setParameter heapProfilingEnabled=true

Since this is a very slow increase in memory, in order to get a good signal we would want to run this way for several days to a week. I would suggest running this way for several days, then upload

  • the mongod log file since the restart
  • the contents of diagnostic.data

At this point I would suggest leaving the mongod running with the heap profiler enabled in case our analysis determines we need to continue to accumulate more data.

Thanks,
Bruce

Comment by Ivan Filimonov [ 18/Jun/18 ]

Checked it myself, createIndex was called on fairly small (at that time) and very recently created collection and finished successfully. Also it looks like mongod process just has crashed right after that, doesn't look like we restarted it.

Comment by Ivan Filimonov [ 18/Jun/18 ]

I've uploaded the log file, it doesn't seem it was ever rotated.

Comment by Ivan Filimonov [ 18/Jun/18 ]

@Bruce Lucas, I'm fairly sure all createIndex operations have now completed, as I have checked currentOp many times since, also the script that does it actually monitors currentOp and waits them to finish before starting processing the next collection. But let me fetch the logs for you.

Comment by Bruce Lucas (Inactive) [ 18/Jun/18 ]

sunrize531@gmail.com, the data for node-e is showing a slow steady increase in memory utilization since about 06-06 08:50 (UTC), coinciding with some createIndexes commands which were executed on that node when it was briefly primary. The memory increase continued after mongod was restarted on 06-08, which makes me suspect it could be due to a very long-running createIndexes command. Can you please also upload the mongod log files for that node from 06-06 to the present, or all available log files if those are no longer available?

Thanks,
Bruce

Comment by Ivan Filimonov [ 18/Jun/18 ]

Should be uploaded now @Bruce Lucas

Comment by Bruce Lucas (Inactive) [ 18/Jun/18 ]

sunrize531@gmail.com, you can upload the files here.

Thanks,
Bruce

Comment by Ivan Filimonov [ 18/Jun/18 ]

@ramon.fernandez, I've got the files and they're rather large... can you give me a link to that secure upload portal of yours?

Comment by Ivan Filimonov [ 18/Jun/18 ]

System engineer isn't sure about the previous version on this cluster, it was either 3.0.12 or 3.2.8. Also, symptoms didn't appear right after mongodb version update, it's started after deploying a new app version. Probably also worth mentioning that it's a node application, and the new version also uses a newer then before mongo driver (updated to 3.0.6 from 2.2.27).
Trying to get diagnostic.data for you now.

Comment by Ivan Filimonov [ 18/Jun/18 ]

tcmalloc data:
For Node D (master)

 { 
    "generic" : {
        "current_allocated_bytes" : NumberLong(1159561776), 
        "heap_size" : NumberLong(1820450816)
    }, 
    "tcmalloc" : {
        "pageheap_free_bytes" : 104038400.0, 
        "pageheap_unmapped_bytes" : 365858816.0, 
        "max_total_thread_cache_bytes" : 517996544.0, 
        "current_total_thread_cache_bytes" : 38480368.0, 
        "total_free_bytes" : 190991824.0, 
        "central_cache_free_bytes" : 150174752.0, 
        "transfer_cache_free_bytes" : 2336704.0, 
        "thread_cache_free_bytes" : 38480368.0, 
        "aggressive_memory_decommit" : 0.0, 
        "pageheap_committed_bytes" : NumberLong(1454592000), 
        "pageheap_scavenge_count" : 2238029.0, 
        "pageheap_commit_count" : 3399131.0, 
        "pageheap_total_commit_bytes" : NumberLong(6883023511552), 
        "pageheap_decommit_count" : 2238029.0, 
        "pageheap_total_decommit_bytes" : NumberLong(6881568919552), 
        "pageheap_reserve_count" : 468.0, 
        "pageheap_total_reserve_bytes" : NumberLong(1820450816), 
        "spinlock_total_delay_ns" : NumberLong(1541250917), 
        "formattedString" : "------------------------------------------------\nMALLOC:     1159562352 ( 1105.8 MiB) Bytes in use by application\nMALLOC: +    104038400 (   99.2 MiB) Bytes in page heap freelist\nMALLOC: +    150174752 (  143.2 MiB) Bytes in central cache freelist\nMALLOC: +      2336704 (    2.2 MiB) Bytes in transfer cache freelist\nMALLOC: +     38479792 (   36.7 MiB) Bytes in thread cache freelists\nMALLOC: +     14180608 (   13.5 MiB) Bytes in malloc metadata\nMALLOC:   ------------\nMALLOC: =   1468772608 ( 1400.7 MiB) Actual memory used (physical + swap)\nMALLOC: +    365858816 (  348.9 MiB) Bytes released to OS (aka unmapped)\nMALLOC:   ------------\nMALLOC: =   1834631424 ( 1749.6 MiB) Virtual address space used\nMALLOC:\nMALLOC:         157865              Spans in use\nMALLOC:            344              Thread heaps in use\nMALLOC:           4096              Tcmalloc page size\n------------------------------------------------\nCall ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).\nBytes released to the OS take up virtual address space but no physical memory.\n"
    }
}

For node E (Secondary with increased memory usage)

{ 
    "generic" : {
        "current_allocated_bytes" : NumberLong(2975454480), 
        "heap_size" : NumberLong(3225341952)
    }, 
    "tcmalloc" : {
        "pageheap_free_bytes" : 53035008.0, 
        "pageheap_unmapped_bytes" : 91459584.0, 
        "max_total_thread_cache_bytes" : 517996544.0, 
        "current_total_thread_cache_bytes" : 19182632.0, 
        "total_free_bytes" : 105392880.0, 
        "central_cache_free_bytes" : 81987080.0, 
        "transfer_cache_free_bytes" : 4223168.0, 
        "thread_cache_free_bytes" : 19182632.0, 
        "aggressive_memory_decommit" : 0.0, 
        "pageheap_committed_bytes" : NumberLong(3133882368), 
        "pageheap_scavenge_count" : 3073699.0, 
        "pageheap_commit_count" : 4511066.0, 
        "pageheap_total_commit_bytes" : NumberLong(15813582766080), 
        "pageheap_decommit_count" : 3073699.0, 
        "pageheap_total_decommit_bytes" : NumberLong(15810448883712), 
        "pageheap_reserve_count" : 319.0, 
        "pageheap_total_reserve_bytes" : NumberLong(3225341952), 
        "spinlock_total_delay_ns" : 933969301.0, 
        "formattedString" : "------------------------------------------------\nMALLOC:     2975455056 ( 2837.6 MiB) Bytes in use by application\nMALLOC: +     53035008 (   50.6 MiB) Bytes in page heap freelist\nMALLOC: +     81987080 (   78.2 MiB) Bytes in central cache freelist\nMALLOC: +      4223168 (    4.0 MiB) Bytes in transfer cache freelist\nMALLOC: +     19182056 (   18.3 MiB) Bytes in thread cache freelists\nMALLOC: +     25714944 (   24.5 MiB) Bytes in malloc metadata\nMALLOC:   ------------\nMALLOC: =   3159597312 ( 3013.2 MiB) Actual memory used (physical + swap)\nMALLOC: +     91459584 (   87.2 MiB) Bytes released to OS (aka unmapped)\nMALLOC:   ------------\nMALLOC: =   3251056896 ( 3100.4 MiB) Virtual address space used\nMALLOC:\nMALLOC:         374865              Spans in use\nMALLOC:            245              Thread heaps in use\nMALLOC:           4096              Tcmalloc page size\n------------------------------------------------\nCall ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).\nBytes released to the OS take up virtual address space but no physical memory.\n"
    }
} 

Comment by Ramon Fernandez Marina [ 18/Jun/18 ]

Hi sunrize531@gmail.com:

  • what version did you upgrade from?
  • will you please upload the contents of the diagnostic.data directory all your nodes so we can investigate?

Thanks,
Ramón.

Generated at Thu Feb 08 04:40:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.