[SERVER-19881] WiredTiger library panic after checkpoint server error Created: 12/Aug/15  Updated: 13/Jun/16  Resolved: 06/Oct/15

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: 3.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: white_li Assignee: David Hows
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

log context and trace

2015-08-12T04:15:23.767Z I NETWORK  [initandlisten] connection accepted from 10.40.146.167:57233 #28796692 (64 connections now open)
2015-08-12T04:15:23.895Z I NETWORK  [conn28796689] end connection 10.40.146.166:52083 (63 connections now open)
2015-08-12T04:15:23.900Z I NETWORK  [initandlisten] connection accepted from 10.40.146.166:52290 #28796693 (64 connections now open)
2015-08-12T04:15:24.083Z I QUERY    [conn28796690] query ipattern.dex_string query: { sha256: "12293ccfdbe29d6db35fbf5b08e43a3599b36cf567404d686c39224f27243013" } planSummary: IXSCAN { sha256: 1.0 } ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:592054 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } 512ms
2015-08-12T04:15:24.377Z I NETWORK  [conn28796691] end connection 10.40.146.168:57956 (63 connections now open)
2015-08-12T04:15:24.403Z I NETWORK  [initandlisten] connection accepted from 10.40.146.168:57969 #28796694 (64 connections now open)
2015-08-12T04:15:24.472Z I NETWORK  [conn28796690] end connection 10.40.146.161:44116 (63 connections now open)
2015-08-12T04:15:24.477Z I NETWORK  [initandlisten] connection accepted from 10.40.146.161:44886 #28796695 (64 connections now open)
2015-08-12T04:15:24.533Z I ACCESS   [conn28796694] Successfully authenticated as principal ipattern on ipattern
2015-08-12T04:15:24.565Z I NETWORK  [conn28796693] end connection 10.40.146.166:52290 (63 connections now open)
2015-08-12T04:15:24.568Z I NETWORK  [initandlisten] connection accepted from 10.40.146.166:52296 #28796696 (64 connections now open)
2015-08-12T04:15:24.803Z E STORAGE  WiredTiger (28) [1439352924:798511][26060:0x7fc24358d700], file:ipattern/collection/0--2932365438447056541.wt, session.checkpoint: ipattern/collection/0--2932365438447056541.wt write error: failed to write 221184 bytes at offset 489973813248: No space left on device
2015-08-12T04:15:24.803Z E STORAGE  WiredTiger (28) [1439352924:803604][26060:0x7fc24358d700], checkpoint-server: checkpoint server error: No space left on device
2015-08-12T04:15:24.803Z E STORAGE  WiredTiger (-31804) [1439352924:803620][26060:0x7fc24358d700], checkpoint-server: the process must exit and restart: WT_PANIC: WiredTiger library panic
2015-08-12T04:15:24.810Z I NETWORK  [conn28657853] end connection 127.0.0.1:24407 (63 connections now open)
2015-08-12T04:15:24.811Z I NETWORK  [conn28657854] end connection 127.0.0.1:24408 (62 connections now open)
2015-08-12T04:15:24.811Z I NETWORK  [conn28657857] end connection 127.0.0.1:24410 (61 connections now open)
2015-08-12T04:15:24.811Z I NETWORK  [conn28657858] end connection 127.0.0.1:24411 (61 connections now open)
2015-08-12T04:15:24.811Z I NETWORK  [conn28657859] end connection 127.0.0.1:24412 (60 connections now open)
2015-08-12T04:15:24.811Z I NETWORK  [conn28657861] end connection 127.0.0.1:24414 (60 connections now open)
2015-08-12T04:15:24.811Z I NETWORK  [conn28657860] end connection 127.0.0.1:24413 (60 connections now open)
2015-08-12T04:15:24.811Z I NETWORK  [conn28657856] end connection 127.0.0.1:24409 (60 connections now open)
2015-08-12T04:15:24.811Z I NETWORK  [conn28657863] end connection 127.0.0.1:24416 (59 connections now open)
2015-08-12T04:15:24.811Z I NETWORK  [conn28657864] end connection 127.0.0.1:24417 (59 connections now open)
2015-08-12T04:15:24.811Z I NETWORK  [conn28657865] end connection 127.0.0.1:24418 (59 connections now open)
2015-08-12T04:15:24.811Z I NETWORK  [conn28657862] end connection 127.0.0.1:24415 (58 connections now open)
2015-08-12T04:15:24.815Z I -        [rsSync] Fatal Assertion 28559
2015-08-12T04:15:24.815Z I -        Fatal Assertion 28558
2015-08-12T04:15:25.005Z I CONTROL  
 0xf77369 0xf140f1 0xef7e41 0xd95afa 0x13ab029 0x13ab1e5 0x13ab684 0x13405f3 0x7fc24f9719d1 0x7fc24e4c38fd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B77369"},{"b":"400000","o":"B140F1"},{"b":"400000","o":"AF7E41"},{"b":"400000","o":"995AFA"},{"b":"400000","o":"FAB029"},{"b":"400000","o":"FAB1E5"},{"b":"400000","o":"FAB684"},{"b":"400000","o":"F405F3"},{"b":"7FC24F96A000","o":"79D1"},{"b":"7FC24E3DB000","o":"E88FD"}],"processInfo":{ "mongodbVersion" : "3.0.4", "gitVersion" : "0481c958daeb2969800511e7475dc66986fa9ed5", "uname" : { "sysname" : "Linux", "release" : "2.6.32-431.el6.x86_64", "version" : "#1 SMP Fri Nov 22 03:15:09 UTC 2013", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "7702B4740E91E1BA6F701DB2A42553968AC09562" }, { "b" : "7FFF580FF000", "elfType" : 3, "buildId" : "81A81BE2E44C93640ADEDB62ADC93A47F4A09DD1" }, { "b" : "7FC24F96A000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "B8DFF8E53D9F2B80C3C382E83EC17C828B536A39" }, { "b" : "7FC24F6FE000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "60304AE7ACD2AD985C6D32C363433C66D96D5992" }, { "b" : "7FC24F31B000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "16120BB9F71E157E33C947AEB2DB0AB550892EFE" }, { "b" : "7FC24F113000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "583411D8786F86A1D6B8741C502831E6122445A7" }, { "b" : "7FC24EF0F000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "454F8FC6CC6502C6401E5F9E221564D80665D277" }, { "b" : "7FC24EC09000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "ED99110E629209C5CA6C0ED704F2C5CE3171513A" }, { "b" : "7FC24E985000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "C9A87F6A29ED1D3CB18F539845A45FE3A9877FF1" }, { "b" : "7FC24E76F000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "246C3BAB0AB093AFD59D34C8CBF29E786DE4BE97" }, { "b" : "7FC24E3DB000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "1425CB3B4C2F49C8101ED9B8F1D289053B4DFA77" }, { "b" : "7FC24FB87000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "6F8E59B70E469F3A924A268911FF8FD0C37E7460" }, { "b" : "7FC24E197000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "54BA6B78A9220344E77463947215E42F0EABCC62" }, { "b" : "7FC24DEB1000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "6797403AA5F8FAD8ADFF683478B45F528CE4FB0E" }, { "b" : "7FC24DCAD000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "8CE28F280150E62296240E70ECAC64E4A57AB826" }, { "b" : "7FC24DA81000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "05733977F4E41652B86070B27A0CFC2C1EA7719D" }, { "b" : "7FC24D86B000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "b" : "7FC24D660000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "E3FA235F3BA3F776A01A18ECA737C9890F445923" }, { "b" : "7FC24D45D000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "b" : "7FC24D243000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "58B696478044E028A5970D48A4ED50E164B43B36" }, { "b" : "7FC24D024000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "E6798A06BEE17CF102BBA44FD512FF8B805CEAF1" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf77369]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf140f1]
 mongod(_ZN5mongo13fassertFailedEi+0x61) [0xef7e41]
 mongod(+0x995AFA) [0xd95afa]
 mongod(__wt_eventv+0x489) [0x13ab029]
 mongod(__wt_err+0x95) [0x13ab1e5]
 mongod(__wt_panic+0x24) [0x13ab684]
 mongod(+0xF405F3) [0x13405f3]
 libpthread.so.0(+0x79D1) [0x7fc24f9719d1]
 libc.so.6(clone+0x6D) [0x7fc24e4c38fd]
-----  END BACKTRACE  -----
2015-08-12T04:15:25.019Z I -        
 
***aborting after fassert() failure



 Comments   
Comment by Ramon Fernandez Marina [ 13/Jun/16 ]

Thanks for the update DrIanSharp@gmail.com, and glad to hear the documentation helped with this matter.

Comment by Ian Sharp [X] [ 12/Jun/16 ]

getting this error

checkpoint-server: the process must exit and restart: WT_PANIC: WiredTiger library panic
Out of file descriptors. Waiting one second before trying to accept more connections.
Fatal Assertion 28558

Following directions at: https://docs.mongodb.com/ecosystem/platforms/amazon-ec2/

solved it for me.

mongod -version

db version v3.2.5
git version: 34e65e5383f7ea1726332cb175b73077ec4a1b02
OpenSSL version: OpenSSL 1.0.0-fips 29 Mar 2010
allocator: tcmalloc
modules: none
build environment:
distmod: amazon
distarch: x86_64
target_arch: x86_64

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.8G 2.7G 5.0G 35% /
devtmpfs 491M 56K 491M 1% /dev
tmpfs 499M 0 499M 0% /dev/shm

i start my mongod in a replicaset with

sudo mongod --fork --port 27017 --replSet go --logpath ~/log/mongo.log

Comment by David Hows [ 06/Oct/15 ]

Marking as "Cannot Reproduce"

Comment by David Hows [ 28/Sep/15 ]

Hi swordsmanli,

Sorry for the long time between responses here.

Assuming your system has not been restarted, can you please post the full dmesg log for this system, not just the grep for "error". Not all system error messages contain this word and we will need to review the whole system log.

If the system has been rebooted, have you seen this error subsequently? If not, it may not be possible to gather the necessary information required to find root cause here.

Thanks,
David

Comment by white_li [ 28/Aug/15 ]

hi Ramon

ulimit limits

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1031967
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 64000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 32000
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

only error info in dmesg
ERST: Error Record Serialization Table (ERST) support is initialized.
ACPI Error: No handler for Region [SYSI] (ffff882029e57348) [IPMI] (20090903/evregion-319)
ACPI Error: Region IPMI(7) has no handler (20090903/exfldio-295)
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PMI0._GHL] (Node ffff882029e549e8), AE_NOT_EXIST
ACPI Error (psparse-0537): Method parse/execution failed [\_SB_.PMI0._PMC] (Node ffff882029e54998), AE_NOT_EXIST

Comment by Ramon Fernandez Marina [ 25/Aug/15 ]

swordsmanli, can you post the output of ulimit -a on this system? Also, are there any quotas? Finally, can you check your system logs for storage errors? These are three possible culprits for this behavior at this stage.

Thanks,
Ramón.

Comment by white_li [ 12/Aug/15 ]

hello Michael

disk is not full
/dev/sdb1 28T 5.5T 22T 21% /data

wt file stats
total 3.8T
rw-rr- 1 mongod mongod 4.0K Aug 12 04:38 8--1546084566581966712.wt
rw-rr- 1 mongod mongod 3.4T Aug 12 09:04 2--2932365438447056541.wt
rw-rr- 1 mongod mongod 457G Aug 12 09:05 0--2932365438447056541.wt

it seems i can not write to 0--2932365438447056541.wt
but when i restart mongod ,it still writes to file 0--2932365438447056541.wt normally
so what is this situation or some other issue , can you help to explain it?

Comment by Michael Cahill (Inactive) [ 12/Aug/15 ]

swordsmanli, that error indicates that writing to a file has failed. In this case, the write apparently failed because the disk was full (one collection grew to almost 500GB, I can't tell how much data is in other collections).

What is your request: an explanation, a clearer error message, more graceful handling of out-of-space errors, or something else?

Generated at Thu Feb 08 03:52:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.