[SERVER-22482] Cache growing to 100% followed by crash Created: 05/Feb/16  Updated: 12/Feb/16  Resolved: 12/Feb/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mike Templeman Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive diagnostics.zip     PNG File ftdc.png     Text File shard2-crash.log    
Issue Links:
Duplicate
duplicates SERVER-20306 75% excess memory usage under WiredTi... Closed
duplicates SERVER-21988 Rollback does not wait for applier to... Closed
duplicates SERVER-22279 SubplanStage fails to register its Mu... Closed
Operating System: ALL
Steps To Reproduce:

Start primary shard server during a busy day and wait a couple of hours

Participants:

 Description   

The primary server on my primary shard has encountered crashing problems repeatedly during production. Mongostat reports the used% of cache growing to 100% (and sometimes 101%) and the dirty % to over 90%. When this situation occurs it is just a matter of time until the server crashes. Memory size and res do not grow to the point where the server crashes because it is out of memory.

Opening the log file for either server (this happened to both the promoted secondary as well as the primary) I find thousands of lines with these error messages:

2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5ECA4877FFFFEB27DF76BB90446AE2462) is less than the previous key (2E4B485BC8877FFFFEAD53C76DEB044B83D4DA), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 140402C0) is less than the previous key (3C0A3238324B204A4F45203A536F43616C4A4F4542000447951242), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F308EB0442B6EA52) is less than the previous key (2E4B485BC8877FFFFEAD526FF19B044B83B3EA), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F3082E2A7DC42278800001464CDE2D0678800001464CDE2D060442B6EA52) is less than the previous key (2E4B485BC82E01D860A07880000152AC3892147880000152AC389214044B83D492), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F308877FFFFEB9B321D2F90442B6EA52) is less than the previous key (2E4B485BC8877FFFFEAD53C76DEB044B83D4DA), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 140402E8) is less than the previous key (3C0A3238324B204A4F45203A536F43616C4A4F4542000447951242), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F478EB04445670AA) is less than the previous key (2E4B485BC8877FFFFEAD526FF19B044B83B3EA), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F4782D317A2E7880000149E9C7C8D27880000149E9C7C8D204445670AA) is less than the previous key (2E4B485BC82E01D860A07880000152AC3892147880000152AC389214044B83D492), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F478877FFFFEB61638372D04445670AA) is less than the previous key (2E4B485BC8877FFFFEAD53C76DEB044B83D4DA), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 14040300) is less than the previous key (3C0A3238324B204A4F45203A536F43616C4A4F4542000447951242), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F5FA877FFFFEAD520BCC4D0445D35EFA) is less than the previous key (2E4B485BC8877FFFFEAD526FF19B044B83B3EA), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F5FA2D27E306788000014BFD075515788000014BFD07551504453ACEF2) is less than the previous key (2E4B485BC82E01D860A07880000152AC3892147880000152AC389214044B83D492), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F5FA877FFFFEB06628460F044880E17A) is less than the previous key (2E4B485BC8877FFFFEAD53C76DEB044B83D4DA), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 14040308) is less than the previous key (3C0A3238324B204A4F45203A536F43616C4A4F4542000447951242), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F5FA877FFFFEAD5271F85C04475D1322) is less than the previous key (2E4B485BC8877FFFFEAD526FF19B044B83B3EA), which is a bug.
2016-02-04T23:46:00.871+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F5FA2E013EB10478800001497BF3CA7378800001497BF3CA73044431CE52) is less than the previous key (2E4B485BC82E01D860A07880000152AC3892147880000152AC389214044B83D492), which is a bug.
2016-02-04T23:46:00.870+0000 I STORAGE  [conn42] WTIndex::updatePosition -- the new key ( 2E15D5F5FAEB0442EEBC72) is less than the previous key (2E4B485BC8877FFFFEAD526FF19B044B83B41A), which is a bug.
2016-02-04T23:46:00.872+0000 I STORAGE  [conn40] WTIndex::updatePosition -- the new key ( 2E15D5F5FA877FFFFEB1AD52F4D804475D1322) is less than the previous key (2E4B485BC8877FFFFEAD53C76DEB044B83D4DA), which is a bug.
2016-02-04T23:46:00.872+0000 I STORAGE  [conn42] WTIndex::updatePosition -- the new key ( 2E15D5F5FA2EC2DA5340788000014E52AD0B27788000014E52AD0B2704475D1322) is less than the previous key (2E4B485BC82E01DE84C27880000152AA3D0BAB7880000152AA3D0BAB044B8272B2), which is a bug.
2016-02-04T23:46:00.872+0000 I STORAGE  [conn42] WTIndex::updatePosition -- the new key ( 2E15D5F5FA877FFFFEB9156C2AD90442EEBC72) is less than the previous key (2E4B485BC8877FFFFEAD53C76DEB044B83D4EA), which is a bug.

The end of the log has no crash information at all.



 Comments   
Comment by Ramon Fernandez Marina [ 12/Feb/16 ]

Hi Mike,

the 3.2.2 release is currently scheduled for next week. The 3.2.2-rc2 release candidate should have the same bits.

Regards,
Ramón.

Comment by Michael Templeman [ 12/Feb/16 ]

Thanks Thomas.

Just as a fyi, SERVER-22279 errors come in large sequences that quickly
fill up the cache and essentially stop the server from processing until the
cache is reduced, which can only happen when the errors stop. The only
workaround I have found so far is to step the server down and restart it.

Do you have an expected date for releasing 3.2.2?

Mike Templeman
Head of Development

T: @talkingfrog1950 <http://twitter.com/missdestructo>
T: @Meshfire <http://twitter.com/meshfire>

On Fri, Feb 12, 2016 at 10:41 AM, Thomas Schubert (JIRA) <jira@mongodb.org>

Comment by Kelsey Schubert [ 12/Feb/16 ]

Hi miketempleman,

Thank you for opening this ticket. I'm sorry you have run into these issues. I'd like to summarize our findings:

  • SERVER-22279 corrects the WTIndex::updatePosition errors that you are experiencing
  • Additionally, SERVER-21988 fixes the Fatal assertion 28723 error that you experienced.
  • Finally, SERVER-20306 explains the 5 GB of free memory. Please watch SERVER-20306 for updates as we work to correct this issue.

Both SERVER-22279 and SERVER-21988 are included 3.2.2-rc2 release candidate, which is now available for download if you would like to test them.

Since we have identified the causes for this behavior, I will be closing this ticket as a duplicate of these tickets.

If you see WT cache growing significantly past its limit or if you experience any other issues that cannot be explained by these tickets, please do not hesitate to open a new ticket.

Thank you again,
Thomas

Comment by Michael Templeman [ 05/Feb/16 ]

Bruce

I will set the cache size on all instances to the default. I can see that
the WT cache often grows beyond the set amount so allowing the default will
not leave GB of memory unused.

I am surprised by the number of connections in comparison to the other
shards. I will need to dig into that to find out what is going on. Every
application server has it's own mongos server as well as it's own
connection pool. The primary shard will, of course, get more queries and
updates but 4X more than the other shards seems excessive.

Finally and most importantly, please accept my personal apology for my
unprofessional message. No excuse for it.

Mike

Mike Templeman
Head of Development

T: @talkingfrog1950 <http://twitter.com/missdestructo>
T: @Meshfire <http://twitter.com/meshfire>

Comment by Michael Templeman [ 05/Feb/16 ]

Ramon

Can you delete my last message? It was both unprofessional and unfair. No
excuse for it. Please extend my apologies to anyone on the teams who might
have encountered it.

Thanks

Mike Templeman
Head of Development

T: @talkingfrog1950 <http://twitter.com/missdestructo>
T: @Meshfire <http://twitter.com/meshfire>

On Thu, Feb 4, 2016 at 5:27 PM, Ramon Fernandez (JIRA) <jira@mongodb.org>

Comment by Bruce Lucas (Inactive) [ 05/Feb/16 ]

Hi Mike,

We've reviewed the data and it appears that

  • we see frequent crashes after virtual memory usage grows to 32 GB; this is probably the o/s OOM killer
  • the memory is accounted for by
    • about 25 GB in WT cache
    • around 5 GB of free memory; this is like memory fragmentation, tracked by SERVER-20306
    • an additional ~2GB of virtual memory, probably related to the nearly 2k connections

Given this, it doesn't appear there are any additional issues beyond what we are aware of related to memory usage, and reducing the WT cache to its default value should eliminate the OOM crashes. (Note by the way that reducing WT cache does not necessarily reduce performance, and in fact under some conditions can improve it, because the remaining memory is used by the kernel as a filesystem cache, and that can have an even greater benefit than WT cache.)

Comment by Ramon Fernandez Marina [ 05/Feb/16 ]

miketempleman, my colleagues in the Replication team tell me that the Fatal assertion 28723 error you're seeing in the logs has been fixed in SERVER-21988, and the fix is included in 3.2.2, so if you upgrade the affected node(s) to 3.2.2-rc0 the problem should go away.

I looked at the diagnostic data and it doesn't seem like there were any other problems going on, but I'd still recommend you consider using the default cache size for the WiredTiger engine.

I'd understand if your decision to move back to 3.0 was firm, but I'd still like to encourage you to give 3.2.2-rc0 a try.

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 05/Feb/16 ]

Thanks for the additional information miketempleman. I'm sorry to hear that the experience with 3.2.1 hasn't been good so far, and I hope we can get to the bottom if these issues to make MongoDB 3.2 better. If you want to downgrade I'd recommend 3.0.9, the latest stable release in the 3.0 series.

The first thing I need to mention is that it's not recommended to increase the WiredTiger cache size: this is the size limit for the WiredTiger cache onlymongod will use additional memory, and increasing the cache size may reduce performance and increase the chances that the OOM killer will terminate mongod.

That being said, the log does show a fatal assertion that needs a closer look:

2016-02-05T03:20:44.691+0000 I REPL     [ReplicationExecutor] syncing from: s2r2.db.meshfire.com:27017
2016-02-05T03:20:44.693+0000 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to s2r2.db.meshfire.com:27017
2016-02-05T03:20:44.696+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to s2r2.db.meshfire.com:27017
2016-02-05T03:20:44.697+0000 I REPL     [rsBackgroundSync] starting rollback: OplogStartMissing our last op time fetched: (term: 12, timestamp: Feb  5 01:56:40:9). source's GTE: (term: 13, timestamp: Feb  5 01:56:43:1) hashes: (-6186595901470975452/13223962844377986)
2016-02-05T03:20:44.698+0000 F REPL     [rsBackgroundSync] need to rollback, but in inconsistent state
2016-02-05T03:20:44.698+0000 I -        [rsBackgroundSync] Fatal assertion 28723 UnrecoverableRollbackError need to rollback, but in inconsistent state. minvalid: (term: 13, timestamp: Feb  5 01:56:45:86) > our last optime: (term: 12, timestamp: Feb  5 01:56:40:9) @ 18750
2016-02-05T03:20:44.698+0000 I -        [rsBackgroundSync] 
 
***aborting after fassert() failure

I'm also looking at the diagnostics data, please continue to watch this ticket for updates.

Thanks,
Ramón.

Comment by Mike Templeman [ 05/Feb/16 ]

Log of failed restart on secondary due to oplog inconsistencies

Comment by Mike Templeman [ 05/Feb/16 ]

files in <data>/diagnostic.data/

Comment by Mike Templeman [ 05/Feb/16 ]

Configuration is an AWS ec2 i2.xlarge instance (4 virtual cpu's, 30.5GB RAM, 800GB SSD ephemeral) running AWS Linux (ami-1ecae776). The data is stored on the ephemeral drive.

Startup options are:- -storageEngine wiredTiger --wiredTigerCacheSizeGB 24 --logpath xxxxx --oplogSize 90000 --fork --replSet xxxxx

The log file of the server after restart is for a secondary. Are you sure you want that log?

Comment by Ramon Fernandez Marina [ 05/Feb/16 ]

miketempleman, I'm not sure the behavior you describe (cache growth and crash) is related to the messages you're seeing in the logs.

The WTIndex::updatePosition errors you're seeing look like a manifestation of SERVER-22279, which was fixed recently in 3.2.2. A 3.2.2-rc0 release candidate was released today and it's available for download, so I'd like to ask you to test it to address that issue.

As for the cache growth and crash, please send us the following information so we can investigate further:

  • full logs for the affected node(s) since the last restart
  • the contents of the <dbpath>/diagnostics.data directory
  • the configuration details for the machine hosting this mongod (CPU, available RAM, etc.)
  • the contents of your mongod.conf file, or a list of configuration options used to run this mongod

If the end of the log has no information at all it's possible that the mongod process was killed by the OS, which can happen if a but allows the cache to grow past its configured limit. The information requested above should point us in the right direction.

Thanks,
Ramón.

Generated at Thu Feb 08 04:00:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.