[SERVER-17047] MongoD aborts when running under memory pressure Created: 26/Jan/15  Updated: 03/Feb/15  Resolved: 26/Jan/15

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.0.0-rc6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Eitan Klein Assignee: Unassigned
Resolution: Done Votes: 0
Labels: 28qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File rs1.txt    
Issue Links:
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Configured Windows 8 on HyperV environment and assign low amount RAM for quick repro

Environment

  • 2 replica set on windows 8 with 2 G byte of RAM and page file = 0.5 GB
    --replSet EitanRs3 --port 5001 --dbpath c:\data\db1 --wiredTigerCacheSizeGB 1 --storageEngine wiredTiger --logpath
  • Hammar.mongo insert only profile
    command line -initdb=false –server ip:port –profile=INSERT –totaltime=52000 –worker=8–rps=0
Participants:

 Description   

Dump file \\eitan6\tmp\eitan8av.dmp
net use
eitan6 /u:eitan6\administrator Admin01
eitan6 ip (10.4.109.190)

The log file indicated Invariant failure(see below)

2015-01-25T08:07:53.158-0800 I -        [conn17] Invariant failure: ret resulted in status UnknownError -28992: Not enough storage is available to process this command.

If I understand correctly the crash happen because WT ONLY reported about the error, however, the evict thread executed before the invariant kicked in and blocked a graceful termination of the Mongod.

015-01-25T08:07:53.129-0800 E STORAGE  [conn17] WiredTiger (-28992) [1422202073:27909][2072:140723412734768], file:collection-6-8102947714341443838.wt, cursor.next: memory allocation: Not enough storage is available to process this command.
 

000000c0`71ccf950 00007ffc`b9e16a16 ntdll!RtlReportCriticalFailure+0x8c
000000c0`71ccfa60 00007ffc`b9e17614 ntdll!RtlpHeapHandleError+0x12
000000c0`71ccfa90 00007ffc`b9dd2b4d ntdll!RtlpLogHeapFailure+0xa4
000000c0`71ccfac0 00007ff7`bc4a78f0 ntdll!RtlFreeHeap+0x6bfbd
000000c0`71ccfb60 00007ff7`bc3fc8b1 mongod!free+0x1c [f:\dd\vctools\crt\crtw32\heap\free.c @ 51]
000000c0`71ccfb90 00007ff7`bc3fc815 mongod!__free_skip_array+0x71 [c:\data\mci\shell\src\src\third_party\wiredtiger\src\btree\bt_discard.c @ 356]
000000c0`71ccfbe0 00007ff7`bc3fcec4 mongod!__free_page_row_leaf+0x95 [c:\data\mci\shell\src\src\third_party\wiredtiger\src\btree\bt_discard.c @ 336]
000000c0`71ccfc20 00007ff7`bc409950 mongod!__wt_page_out+0xd4 [c:\data\mci\shell\src\src\third_party\wiredtiger\src\btree\bt_discard.c @ 124]
000000c0`71ccfc50 00007ff7`bc43312a mongod!__wt_split_multi+0x1b0 [c:\data\mci\shell\src\src\third_party\wiredtiger\src\btree\bt_split.c @ 1535]
000000c0`71ccfce0 00007ff7`bc4336ef mongod!__evict_page_dirty_update+0x10a [c:\data\mci\shell\src\src\third_party\wiredtiger\src\evict\evict_page.c @ 215]
000000c0`71ccfd10 00007ff7`bc432d1e mongod!__wt_evict+0x1ff [c:\data\mci\shell\src\src\third_party\wiredtiger\src\evict\evict_page.c @ 110]
000000c0`71ccfd90 00007ff7`bc431f0a mongod!__wt_evict_lru_page+0xae [c:\data\mci\shell\src\src\third_party\wiredtiger\src\evict\evict_lru.c @ 1310]
000000c0`71ccfdd0 00007ff7`bc431c0c mongod!__evict_server_work+0x6a [c:\data\mci\shell\src\src\third_party\wiredtiger\src\evict\evict_lru.c @ 827]
000000c0`71ccfe00 00007ff7`bc431ccc mongod!__evict_pass+0x22c [c:\data\mci\shell\src\src\third_party\wiredtiger\src\evict\evict_lru.c @ 495]
000000c0`71ccfe60 00007ffc`b90715bd mongod!__evict_server+0x3c [c:\data\mci\shell\src\src\third_party\wiredtiger\src\evict\evict_lru.c @ 164]
000000c0`71ccfe90 00007ffc`b9d943d1 kernel32!BaseThreadInitThunk+0xd
000000c0`71ccfec0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

 



 Comments   
Comment by Mark Benvenuto [ 26/Jan/15 ]

I am closing this as by design.

The times in the dump file (i,e the stack in the bug) do not line up with the log file so I cannot say anything about cause/effect. I believe it may be a double free or overwrite, but it will like not repro easily. You would need to trap in the debugger, and use the various !heap commands in inspect it.

The abort/rude shutdown is expected in this case. When WT returns an unexpected status code (basically any error generally), we raise an invariantWTOK which aborts the server. This means when WT hits OOM, out of disk space, etc, we shutdown since we do not have special error handling code for the arbitrary OS errors that WT may return.

Generated at Thu Feb 08 03:43:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.