[SERVER-19751] WiredTiger panic halt in eviction-server Created: 04/Aug/15  Updated: 24/Aug/15  Resolved: 12/Aug/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.5
Fix Version/s: 3.0.6, 3.1.7

Type: Bug Priority: Major - P3
Reporter: Michael Templeman Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: RF, WTmem
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

AWS Linux amzn-ami-hvm-2015.03.0.x86_64-gp2 (ami-1ecae776) on a i2.xlarge instance


Issue Links:
Depends
is depended on by WT-1973 MongoDB changes for WiredTiger 2.7.0 Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: Linux
Backport Completed:
Participants:

 Description   

Primary server on primary shard running mongodb 3.0.5 encountered panic halt. Log report of the crash:

2015-08-04T12:21:33.770+0000 E STORAGE  WiredTiger (11) [1438690893:759387][2653:0x7f6a60957700], eviction-server: pthread_create: Resource temporarily unavailable
2015-08-04T12:21:33.771+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to xxx.xxx.xxx.xxx:27009; UnknownError Sending command { replSetHeartbeat: "xxxxx", pv: 1, v: 88440, from: "s
1r2.db.meshfire.com:27017", fromId: 35, checkEmpty: false } on database admin over network to xxx.xxx.xxx.xxx:27009 received exception boost::thread_resource_error
2015-08-04T12:21:34.171+0000 I NETWORK  [conn4522] end connection 10.169.110.83:58092 (178 connections now open)
2015-08-04T12:21:34.171+0000 I NETWORK  [conn3298] end connection 10.146.247.2:59139 (178 connections now open)
2015-08-04T12:21:34.171+0000 I NETWORK  [initandlisten] connection accepted from 10.146.247.2:33242 #4528 (181 connections now open)
2015-08-04T12:21:34.371+0000 E STORAGE  WiredTiger (11) [1438690893:983089][2653:0x7f6a60957700], eviction-server: cache eviction server error: Resource temporarily unavailable
2015-08-04T12:21:34.475+0000 I REPL     [ReplicationExecutor] replset: another primary seen with election time 55c0ae3f:1 my election time is 55c07850:1
2015-08-04T12:21:34.483+0000 I COMMAND  [conn4525] command admin.$cmd command: isMaster { isMaster: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:514 locks:{} 1221ms
2015-08-04T12:21:34.483+0000 I COMMAND  [conn4524] command admin.$cmd command: isMaster { isMaster: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:514 locks:{} 1221ms
2015-08-04T12:21:34.791+0000 I -        [conn3313] Fatal Assertion 28559
2015-08-04T12:21:34.795+0000 E STORAGE  WiredTiger (-31804) [1438690894:584032][2653:0x7f6a60957700], eviction-server: the process must exit and restart: WT_PANIC: WiredTiger library panic
2015-08-04T12:21:34.794+0000 I NETWORK  [initandlisten] connection accepted from 10.155.4.24:55882 #4529 (180 connections now open)
2015-08-04T12:21:34.993+0000 I -        [WiredTigerRecordStoreThread for local.oplog.rs] Fatal Assertion 28559
2015-08-04T12:21:35.078+0000 E STORAGE  WiredTiger (-31804) [1438690894:891356][2653:0x7f6a5e152700], session.checkpoint: metadata unroll update file:collection-426-1927279256388819562.wt to allocation_size=4KB,app_metadata=(formatVersion=1),block_allocation=best,block_compressor=snappy,cache_resident=0,checkpoint=(WiredTigerCheckpoint.5871=(addr="019681e48761b57f9781e4a3ff13a59881e41e365d48808080e3317fc0e401820fc0",order=5871,time=1438690732,size=25309184,write_gen=2990956)),checkpoint_lsn=(6517,78431488),checksum=on,collator=,columns=,dictionary=0,format=btree,huffman_key=,huffman_value=,id=429,internal_item_max=0,internal_key_max=0,internal_key_truncate=,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=0,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,value_format=u,version=(major=1,minor=1): WT_PANIC: WiredTiger library panic

The backtrace:

----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B77CA9"},{"b":"400000","o":"B16B61"},{"b":"400000","o":"AFB6E1"},{"b":"400000","o":"9A193A"},{"b":"400000","o":"FAA2F9"},{"b":"400000","o":"FAA4B5"},{"b":"400000","o":"F782F3"},{"b":"400000","o":"FB3811"},{"b":"400000","o":"FA6A86"},{"b":"400000","o":"F3F33A"},{"b":"7F6A6A082000","o":"7DF3"},{"b":"7F6A68A47000","o":"F61BD"}],"processInfo":{ "mongodbVersion" : "3.0.5", "gitVersion" : "8bc4ae20708dbb493cb09338d9e7be6698e4a3a3", "uname" : { "sysname" : "Linux", "release" : "3.14.48-33.39.amzn1.x86_64", "version" : "#1 SMP Tue Jul 14 23:43:07 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "E8A046532FE2C5A8474250BA8E5B72C6C84B9D68" }, { "b" : "7FFE218AA000", "elfType" : 3, "buildId" : "DA2E8BBD29FCD73624DD2C4211BC3845464603D1" }, { "b" : "7F6A6A082000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "D48D3E6672A77B603B402F661BABF75E90AD570B" }, { "b" : "7F6A69E15000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "22480480235F3B1C6C2E5E5953949728676D3796" }, { "b" : "7F6A69A30000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "F1C146B78505646930DD9003AA2B3622C5226D1B" }, { "b" : "7F6A69828000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "E81013CBFA409053D58A65A0653271AB665A4619" }, { "b" : "7F6A69624000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "62A8842157C62F95C3069CBF779AFCC26577A99A" }, { "b" : "7F6A69320000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "DD6383EEAC49E9BAA9E3D1080AE932F42CF8A385" }, { "b" : "7F6A6901E000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "5F97F8F8E5024E29717CF35998681F84D4A22D45" }, { "b" : "7F6A68E08000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "C52958E393BDF8E8D090F36DE0F4E620D8736FBF" }, { "b" : "7F6A68A47000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "DF6DA145A649EA093507A635AF383F608E7CE3F2" }, { "b" : "7F6A6A29E000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "6F90843B9087FE91955FEB0355EB0858EF9E97B2" }, { "b" : "7F6A687FB000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "FF843C37C38E5BFFD57F7BCCAE05FDADC6390C8F" }, { "b" : "7F6A68518000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "0BB150CC29DB5B0E039879EFC00152A75E3B00B9" }, { "b" : "7F6A68315000", "path" : "/usr/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "622F315EB5CB2F791E9B64020692EBA98195D06D" }, { "b" : "7F6A680E3000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "1485992B0E5CDBA0A34817FC8C6A4C45E82CD1A9" }, { "b" : "7F6A67ECD000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "89C6AF118B6B4FB6A73AE1813E2C8BDD722956D1" }, { "b" : "7F6A67CBE000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "A75A81EC50E9E0164A12B59D9987AD61AC7576C8" }, { "b" : "7F6A67ABB000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "37A58210FA50C91E09387765408A92909468D25B" }, { "b" : "7F6A678A1000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "6A7DA1CED90F65F27CB7B5BACDBB1C386C05F592" }, { "b" : "7F6A67680000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "F5054DC94443326819FBF3065CFDF5E4726F57EE" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf77ca9]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf16b61]
 mongod(_ZN5mongo13fassertFailedEi+0x61) [0xefb6e1]
 mongod(+0x9A193A) [0xda193a]
 mongod(__wt_eventv+0x489) [0x13aa2f9]
 mongod(__wt_err+0x95) [0x13aa4b5]
 mongod(__wt_meta_track_off+0x1D3) [0x13782f3]
 mongod(__wt_txn_checkpoint+0xAE1) [0x13b3811]
 mongod(+0xFA6A86) [0x13a6a86]
 mongod(+0xF3F33A) [0x133f33a]
 libpthread.so.0(+0x7DF3) [0x7f6a6a089df3]
 libc.so.6(clone+0x6D) [0x7f6a68b3d1bd]
-----  END BACKTRACE  -----

We have had repeated occurrences of the primary server cache going to 100% and remaining there until the instance is stepped down. We also experience frequent OOM crashes as the memory footprint of mongodb grows beyond available memory. This condition occurs despite setting the wired tiger cache size to 6GB less than available memory.



 Comments   
Comment by Githook User [ 06/Aug/15 ]

Author:

{u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith.bostic@mongodb.com'}

Message: SERVER-19751 Retry pthread_create on EAGAIN or EINTR.

Merge pull request #2107 from wiredtiger/pthread-create-retry
(cherry picked from commit b52d2d3a9cfdfa32ac2eacf3d3d5f281395d834f)
Branch: mongodb-3.0
https://github.com/wiredtiger/wiredtiger/commit/7cb51cd688c2375a504f99d5c57674e7fdfbcee0

Comment by Githook User [ 05/Aug/15 ]

Author:

{u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith.bostic@mongodb.com'}

Message: Merge pull request #2107 from wiredtiger/pthread-create-retry

SERVER-19751 Retry pthread_create on EAGAIN or EINTR.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/b52d2d3a9cfdfa32ac2eacf3d3d5f281395d834f

Comment by Githook User [ 05/Aug/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-19751 Retry pthread_create on EAGAIN or EINTR.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/4a322631ec244fe5cc204b560ab09cda60ada00a

Comment by Michael Templeman [ 04/Aug/15 ]

oops, I forgot to report that the ulimit was set to 64000, THP was turned off and tcp_keepalive_time is set to 120.

Comment by Michael Templeman [ 04/Aug/15 ]

Ramon

The i2.xlarge ec2 instance has 30GB of memory. Nothing else runs on the server than mongo. Mongod was started with:

mongod --dbpath xxx/xxxxx --storageEngine wiredTiger --wiredTigerCacheSizeGB 22 --logpath /xxxx/xxxxx/mongo.log --oplogSize 90000 --fork --replSet xxxxx

I don't know what the actual memory utilization was when the process halted. Past experience has been that under certain operations the process cache goes to 100% and the process memory utilization as reported by mongostat will rise to around 29GB (at which point a crash is imminent).

Comment by Ramon Fernandez Marina [ 04/Aug/15 ]

mike@meshfire.com, the error message:

2015-08-04T12:21:33.770+0000 E STORAGE  WiredTiger (11) [1438690893:759387][2653:0x7f6a60957700], eviction-server: pthread_create: Resource temporarily unavailable

indicates that the eviction server failed to create a new thread with EAGAIN. pthread_create(3) reads:

       EAGAIN Insufficient resources to create another thread, or a system-imposed limit on  the  number  of  threads  was
              encountered.   The  latter  case  may occur in two ways: the RLIMIT_NPROC soft resource limit (set via setr‐
              limit(2)), which limits the number of process for a real user ID, was reached; or the  kernel's  system-wide
              limit on the number of threads, /proc/sys/kernel/threads-max, was reached.

Can you please provide more details about the memory configuration and utilization on the affected node, as well as information about system limits for the mongod process?

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 04/Aug/15 ]

Hi mike@meshfire.com, thanks for the report – we're investigating. I'd recommend the {noformat} macro for logs, I've edited the description above to use it.

Comment by Michael Templeman [ 04/Aug/15 ]

Darn. I screwed up the formatting again. For log dumps what is the proper format command?

Generated at Thu Feb 08 03:51:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.