[SERVER-53708] Excess memory usage during shutdown in durable history tests Created: 12/Jan/21  Updated: 27/Oct/23  Resolved: 17/Oct/21

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: James O'Leary Assignee: Chenhao Qu
Resolution: Gone away Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-10-13 at 4.11.49 pm.png     PNG File Screen Shot 2021-10-13 at 4.12.46 pm.png     PNG File Screen Shot 2021-10-13 at 5.22.32 pm.png     PNG File mem.png    
Issue Links:
Backports
Depends
is depended on by WT-6924 Queue history store pages for urgent ... Closed
Related
related to SERVER-54472 Collect additional FTDC metrics durin... Closed
is related to WT-6300 Delay setting leak memory flag until ... Closed
Operating System: ALL
Backport Requested:
v5.1
Steps To Reproduce:

Use DSI locally to setup a perf cluster:

  1. On the iwc node with the following command line:

     (cd src/genny && ./src/workloads/contrib/historystore/eMRCf_runner.sh -g 8 -n)
    

 

Sprint: Storage - Ra 2021-09-20, Storage - Ra 2021-10-04, Storage - Ra 2021-10-18
Participants:
Story Points: 3

 Description   

Executing the eMRCf_runner.sh tests with more than 7 growth iterations and enableMajorityReadConcern set to true results in a SIGAbort when shutting down the primary.

{"t":{"$date":"2021-01-10T00:14:48.681+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"conn239","msg":"WiredTiger error","attr":{"error":12,"message":"[1610237688:672274][43957:0x7ff5de9fd700], file:index-39-3882910392635076206.wt, txn rollback_to_stable: __wt_calloc, 52: memory allocation of 2848 bytes failed: Cannot allocate memory"}}
{"t":{"$date":"2021-01-10T00:14:48.695+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"conn239","msg":"WiredTiger error","attr":{"error":12,"message":"[1610237688:695405][43957:0x7ff5de9fd700], WT_CONNECTION.close: __wt_calloc, 52: memory allocation of 264400 bytes failed: Cannot allocate memory"}}
{"t":{"$date":"2021-01-10T00:14:48.697+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"conn239","msg":"WiredTiger error","attr":{"error":12,"message":"[1610237688:696255][43957:0x7ff5de9fd700], WT_CONNECTION.close: __conn_close, 1035: failure during close, disabling further writes: Cannot allocate memory"}}
{"t":{"$date":"2021-01-10T00:14:49.054+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"thread355","msg":"WiredTiger error","attr":{"error":-31804,"message":"[1610237689:54366][43957:0x7ff60225f700], log-wrlsn-server: __log_wrlsn_server, 871: log wrlsn server error: WT_PANIC: WiredTiger library panic"}}
{"t":{"$date":"2021-01-10T00:14:49.057+00:00"},"s":"F",  "c":"-",        "id":23089,   "ctx":"thread355","msg":"Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":537}}
{"t":{"$date":"2021-01-10T00:14:49.057+00:00"},"s":"F",  "c":"-",        "id":23090,   "ctx":"thread355","msg":"\n\n***aborting after fassert() failure\n\n"}
{"t":{"$date":"2021-01-10T00:14:49.059+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"thread355","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
{"t":{"$date":"2021-01-10T00:14:49.238+00:00"},"s":"E",  "c":"CONTROL",  "id":31430,   "ctx":"thread355","msg":"Error collecting stack trace","attr":{"error":"unw_get_proc_name(564559009DFA): unspecified (general) error\nunw_get_proc_name(56455900B1F9): unspecified (general) error\nunw_get_proc_name(564559006DF6): unspecified (general) error\nunw_get_proc_name(7FF60AE1A7E0): unspecified (general) error\nunw_get_proc_name(7FF60AA91B20): unspecified (general) error\nunw_get_proc_name(7FF60AA92FC8): unspecified (general) error\nunw_get_proc_name(5645568CCE5A): unspecified (general) error\nunw_get_proc_name(5645565CA91F): unspecified (general) error\nunw_get_proc_name(5645573CF68B): unspecified (general) error\nunw_get_proc_name(5645565E81BE): unspecified (general) error\nunw_get_proc_name(5645565D376B): unspecified (general) error\nunw_get_proc_name(7FF60AE1040B): unspecified (general) error\nunw_get_proc_name(7FF60AB4AF9F): unspecified (general) error\nunw_get_proc_name(564559009DFA): unspecified (general) error\nunw_get_proc_name(56455900B1F9): unspecified (general) error\nunw_get_proc_name(564559006DF6): unspecified (general) error\nunw_get_proc_name(7FF60AE1A7E0): unspecified (general) error\nunw_get_proc_name(7FF60AA91B20): unspecified (general) error\nunw_get_proc_name(7FF60AA92FC8): unspecified (general) error\nunw_get_proc_name(5645568CCE5A): unspecified (general) error\nunw_get_proc_name(5645565CA91F): unspecified (general) error\nunw_get_proc_name(5645573CF68B): unspecified (general) error\nunw_get_proc_name(5645565E81BE): unspecified (general) error\nunw_get_proc_name(5645565D376B): unspecified (general) error\nunw_get_proc_name(7FF60AE1040B): unspecified (general) error\nunw_get_proc_name(7FF60AB4AF9F): unspecified (general) error\n"}}
{"t":{"$date":"2021-01-10T00:14:49.238+00:00"},"s":"I",  "c":"CONTROL",  "id":31380,   "ctx":"thread355","msg":"BACKTRACE","attr":{"bt":{"backtrace":[{"a":"564559009DFA","b":"56455568F000","o":"397ADFA"},{"a":"56455900B1F9","b":"56455568F000","o":"397C1F9","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"564559006DF6","b":"56455568F000","o":"3977DF6","s":"abruptQuit","s+":"66"},{"a":"7FF60AE1A7E0","b":"7FF60AE09000","o":"117E0"},{"a":"7FF60AA91B20","b":"7FF60AA5E000","o":"33B20","s":"gsignal","s+":"110"},{"a":"7FF60AA92FC8","b":"7FF60AA5E000","o":"34FC8","s":"abort","s+":"148"},{"a":"5645568CCE5A","b":"56455568F000","o":"123DE5A","s":"_ZN5mongo35fassertFailedWithStatusWithLocationEiRKNS_6StatusEPKcj","s+":"0"},{"a":"5645565CA91F","b":"56455568F000","o":"F3B91F"},{"a":"5645573CF68B","b":"56455568F000","o":"1D4068B"},{"a":"5645565E81BE","b":"56455568F000","o":"F591BE","s":"__wt_panic_func","s+":"B3"},{"a":"5645565D376B","b":"56455568F000","o":"F4476B"},{"a":"7FF60AE1040B","b":"7FF60AE09000","o":"740B"},{"a":"7FF60AB4AF9F","b":"7FF60AA5E000","o":"ECF9F","s":"clone","s+":"3F"}],"processInfo":{"mongodbVersion":"4.9.0-alpha-1214-g54e8728","gitVersion":"54e87285f802a49456c02c80cb0ddb0fbb54c88a","compiledModules":["enterprise"],"uname":{"sysname":"Linux","release":"4.14.165-131.185.amzn2.x86_64","version":"#1 SMP Wed Jan 15 14:19:56 UTC 2020","machine":"x86_64"},"somap":[{"b":"56455568F000","elfType":3,"buildId":"7FEA37A074A30C9B75AC461167B7598F4E171408"},{"b":"7FF60AE09000","path":"/lib64/libpthread.so.0","elfType":3,"buildId":"5E12AD1B2983169B48CE4C7051454E8CA38D6EF0"},{"b":"7FF60AA5E000","path":"/lib64/libc.so.6","elfType":3,"buildId":"DAFC6796B1B45E202959E8872844CABEE8A20A32"}]}}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"564559009DFA","b":"56455568F000","o":"397ADFA"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"56455900B1F9","b":"56455568F000","o":"397C1F9","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"564559006DF6","b":"56455568F000","o":"3977DF6","s":"abruptQuit","s+":"66"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"7FF60AE1A7E0","b":"7FF60AE09000","o":"117E0"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"7FF60AA91B20","b":"7FF60AA5E000","o":"33B20","s":"gsignal","s+":"110"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"7FF60AA92FC8","b":"7FF60AA5E000","o":"34FC8","s":"abort","s+":"148"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"5645568CCE5A","b":"56455568F000","o":"123DE5A","s":"_ZN5mongo35fassertFailedWithStatusWithLocationEiRKNS_6StatusEPKcj","s+":"0"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"5645565CA91F","b":"56455568F000","o":"F3B91F"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"5645573CF68B","b":"56455568F000","o":"1D4068B"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"5645565E81BE","b":"56455568F000","o":"F591BE","s":"__wt_panic_func","s+":"B3"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"5645565D376B","b":"56455568F000","o":"F4476B"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"7FF60AE1040B","b":"7FF60AE09000","o":"740B"}}}
{"t":{"$date":"2021-01-10T00:14:49.239+00:00"},"s":"I",  "c":"CONTROL",  "id":31445,   "ctx":"thread355","msg":"Frame","attr":{"frame":{"a":"7FF60AB4AF9F","b":"7FF60AA5E000","o":"ECF9F","s":"clone","s+":"3F"}}}



 Comments   
Comment by Chenhao Qu [ 13/Oct/21 ]

Here's the cache fill ratio of the data:

Comment by Chenhao Qu [ 13/Oct/21 ]

I managed to run a single round of test on my local dev box. I failed to run for more because an issue in the genny. From the ftdc data I got for that run, I don't reproduce the issue. I have looked at the code, the eviction server is running during the shutdown so I don't expect to see this excess usage of memory in the current implementation. I suspect the issue has been fixed.

Comment by Alexander Gorrod [ 30/Apr/21 ]

This is important, but isn't tied to a release of MongoDB materially, so we are bumping the fix version. Once we have improved the behavior we will backport any relevant changes to earlier releases where applicable.

Comment by Haseeb Bokhari (Inactive) [ 22/Feb/21 ]

We fixed a similar issue in WT-6300.

Comment by Keith Smith [ 10/Feb/21 ]

Yes, of course it is. I wasn't thinking clearly.

Comment by Bruce Lucas (Inactive) [ 10/Feb/21 ]

keith.smith, I think that's this ticket.

Comment by Bruce Lucas (Inactive) [ 10/Feb/21 ]

Actually it's not quite correct that we have no ftdc during shutdown - we do have some metrics. Here's the ftdc from the iterations-08 ftdc from Jim's test attached to this ticket showing system anon, which is total heap for all processes on the machine, growing by about 30 GB or so during shutdown, compared to 2 GB configured cache size and a 17 GB HS file.

Probably we should have a test for this excess memory usage that doesn't depend on OOM to detect it, especially since we're running the test on a machine with a lot more memory than the configured cache size.

Comment by Keith Smith [ 10/Feb/21 ]

bruce.lucas, that's too bad, but your data trumps my intuition .  Can you open a WT ticket for that?

Comment by Bruce Lucas (Inactive) [ 10/Feb/21 ]

So I have a test that shows a similar behavior: with a configured cache size of 2 GB and a large HS (about 10 GB) memory usage increases to about 50 GB during shutdown. This is not improved in 4.4.4-rc1, so I don't think WT-6924 fixes this problem.

Comment by Keith Smith [ 10/Feb/21 ]

bruce.lucas: Yes, with WT-6924 we still allow memory usage to spike, as long as it doesn't exceed our cache limits. Regarding behavior during RTS at shutdown, WT-6924 would help if we're seeing a lot of cache pressure caused by a large amount of history store content.  Since RTS performs a lot of queries on the history store this could come into play. 

But maybe haseeb.bokhari can confrim my intuition.  He did the work on WT-6924 and has also looked WT-7079, which is the long RTS time recovering after the failures described in this ticket.

Comment by Bruce Lucas (Inactive) [ 10/Feb/21 ]

jim.oleary, this test was run on 01-10, which would have been before the fix for WT-6924 went into master on 01-22. Does this issue still occur on a version that includes WT-6924?

keith.smith, as I understand WT-6924 it relates to the spikes of memory usage during checkpoints. We do see such spikes in this data (although never more than 100% of cache, so not in danger of causing OOM). Also the out-of-memory failure here is occurring during rollback to stable on shutdown:

{"t":{"$date":"2021-01-10T00:14:48.681+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"conn239","msg":"WiredTiger error","attr":{"error":12,"message":"[1610237688:672274][43957:0x7ff5de9fd700], file:index-39-3882910392635076206.wt, txn rollback_to_stable: __wt_calloc, 52: memory allocation of 2848 bytes failed: Cannot allocate memory"}}

Would we expect that to be affected by WT-6924? (Unfortunately a problem occurred collecting the stack trace on abort, and we still don't have ftdc during shutdown, so we have limited information about the problem.)

Comment by Bruce Lucas (Inactive) [ 10/Feb/21 ]

I think a better description of the symptom is out of memory; the abort is just a way of terminating the server when that happens.

{"t":{"$date":"2021-01-10T00:14:48.695+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"conn239","msg":"WiredTiger error","attr":{"error":12,"message":"[1610237688:695405][43957:0x7ff5de9fd700], WT_CONNECTION.close: __wt_calloc, 52: memory allocation of 264400 bytes failed: Cannot allocate memory"}}

Comment by Keith Smith [ 12/Jan/21 ]

This is possibly a manifestation of WT-6924, where the WT cache grows far beyond it's configured limit when making heavy use of the history store.  By design, this eMRCf test stresses the history store a lot.

Generated at Thu Feb 08 05:31:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.