[SERVER-26533] WiredTiger library panic Created: 08/Oct/16  Updated: 13/Aug/18  Resolved: 13/Oct/16

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Shuo Wang Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 0
Labels: envm, rge, wtc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongodb_log.tgz    
Operating System: Linux
Participants:

 Description   

A mongodb instance was unexpectedly closed in a replica set. I have to remove all data and start it in order to recover service as soon as possible. The log information as follow:

2016-10-08T11:44:54.060+0800 I CONTROL  [initandlisten] MongoDB starting : pid=16224 port=27017 dbpath=/data/mongodata/data 64-bit host=mongo01.log
2016-10-08T11:44:54.060+0800 I CONTROL  [initandlisten] db version v3.2.1
2016-10-08T11:44:54.060+0800 I CONTROL  [initandlisten] git version: a14d55980c2cdc565d4704a7e3ad37e4e535c1b2
2016-10-08T11:44:54.060+0800 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
2016-10-08T11:44:54.060+0800 I CONTROL  [initandlisten] allocator: tcmalloc
2016-10-08T11:44:54.060+0800 I CONTROL  [initandlisten] modules: none
2016-10-08T11:44:54.060+0800 I CONTROL  [initandlisten] build environment:
2016-10-08T11:44:54.060+0800 I CONTROL  [initandlisten]     distmod: ubuntu1404
2016-10-08T11:44:54.060+0800 I CONTROL  [initandlisten]     distarch: x86_64
2016-10-08T11:44:54.060+0800 I CONTROL  [initandlisten]     target_arch: x86_64
2016-10-08T11:44:54.060+0800 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "0.0.0.0", port: 27017 }, replication: { replSet: "foba" }, security: { keyFile: "/etc/mongodb-keyfile" }, storage: { dbPath:
"/data/mongodata/data", journal: { enabled: false } }, systemLog: { destination: "file", logAppend: true, path: "/data/mongodata/log/mongo.log" } }
2016-10-08T11:44:54.092+0800 I -        [initandlisten] Detected data files in /data/mongodata/data created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2016-10-08T11:44:54.092+0800 W -        [initandlisten] Detected unclean shutdown - /data/mongodata/data/mongod.lock is not empty.
2016-10-08T11:44:54.092+0800 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2016-10-08T11:44:54.092+0800 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=2G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=s
nappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),,log=(enabled=false),
2016-10-08T11:44:54.951+0800 E STORAGE  [initandlisten] WiredTiger (22) [1475898294:951592][16224:0x7f94d0c5ad00], file:collection-155-6426631565121784140.wt, WT_SESSION.open_cursor: live.avail: merge range 28672-40960 overlaps with exi
sting range 28672-32768: Invalid argument
2016-10-08T11:44:54.951+0800 E STORAGE  [initandlisten] WiredTiger (-31804) [1475898294:951642][16224:0x7f94d0c5ad00], file:collection-155-6426631565121784140.wt, WT_SESSION.open_cursor: the process must exit and restart: WT_PANIC: Wire
dTiger library panic
2016-10-08T11:44:54.951+0800 I -        [initandlisten] Fatal Assertion 28558
2016-10-08T11:44:54.951+0800 I -        [initandlisten]
2016-10-08T11:44:54.972+0800 F -        [initandlisten] Got signal: 6 (Aborted).

BTW: The version is v3.2.1.



 Comments   
Comment by Kelsey Schubert [ 23/Jan/17 ]

Hi shuowang,

Unfortunately, as Keith explained, there is no simple process to get the data from a particular block.

Kind regards,
Thomas

Comment by Keith Bostic (Inactive) [ 16/Jan/17 ]

Is there any way to get data from 8192B block at offset 1312190464? I want to know which document corruption in the MongoDB's collection.

I don't think there's any simple solution (although anonymous.user may have a better idea than I do).

In short, documents are encoded BSON, so there may or may not be useful text there, second, documents are generally compressed using snappy, and finally, if the block is truly corrupted, it may not be possible to decompress it.

If the backing collection isn't compressed, I would suggest using a tool like hexdump to display the contents of the block.

If the backing collection is compressed, it's going to be harder to get the information. If you want to use dd to copy out that 8192B block from the file and upload it into this ticket, I could try and crack it and let you know if I see anything useful.

Comment by Shuo Wang [ 16/Jan/17 ]

Hi,

Is there any way to get data from 8192B block at offset 1312190464? I want to know which document corruption in the MongoDB's collection.

read checksum error for 8192B block at offset 1312190464: calculated block checksum of 2510031984 doesn't match expected checksum of 193589556

Thanks

Comment by Alexander Gorrod [ 16/Oct/16 ]

shuowang The error leading to MongoDB shutting down was:

2016-10-15T01:50:51.168+0800 E STORAGE  [conn29728] WiredTiger (0) [1476467451:161860][16707:0x7f0cf8935700], file:collection-681-8812398166181733878.wt, WT_CURSOR.next: read checksum error for 8192B block at offset 1312190464: calculated block checksum of 2510031984 doesn't match expected checksum of 193589556
2016-10-15T01:50:51.168+0800 E STORAGE  [conn29728] WiredTiger (0) [1476467451:168136][16707:0x7f0cf8935700], file:collection-681-8812398166181733878.wt, WT_CURSOR.next: collection-681-8812398166181733878.wt: encountered an illegal file format or internal value
2016-10-15T01:50:51.168+0800 E STORAGE  [conn29728] WiredTiger (-31804) [1476467451:168152][16707:0x7f0cf8935700], file:collection-681-8812398166181733878.wt, WT_CURSOR.next: the process must exit and restart: WT_PANIC: WiredTiger library panic

That error means WiredTiger (the storage engine for MongoDB) detected that there was a disk corruption affecting the database. Is it possible that your underlying storage system is unreliable? I recommend running integrity checks on your hardware to look for problems.

When MongoDB detects an on-disk data corruption, it exits immediately, since attempting to continue could lead to more extensive data corruption and/or data loss. The behavior you report is expected in this situation.

Comment by Shuo Wang [ 16/Oct/16 ]

There was a mongodb instance crashed again at yesterday. But with the previous situation is different from the crashed instance can be restarted normally at this time. I've archived the diagnostic.data and mongodb.log. Please see the attachment.

Thanks.

Comment by Kelsey Schubert [ 13/Oct/16 ]

Hi shuowang,

Since there isn't anything more we can do to progress our investigation at this time, I am going to close this ticket. Please let us know if you encounter this issue again, and we will reopen this ticket and continue to investigate.

Kind regards,
Thomas

Comment by Shuo Wang [ 12/Oct/16 ]

Thanks. It's a very helpful suggestion.

Comment by Kelsey Schubert [ 12/Oct/16 ]

Hi shuowang,

Unfortunately, from the information currently available, we cannot identify the root cause of this issue. Would you please upgrade to MongoDB 3.2.10 (the latest release) and make sure you’re configured to track error messages on failure and capture diagnostic information?

Please also confirm your filesystem isn’t running out of space or doesn’t otherwise fail to meet the requirements of MongoDB.

If this error happens again, we will need an archive of the diagnostic.data, which would cover the period leading up to the failure, and confirmation no system errors occurred. In addition, I would recommend copying the the $dbpath before resyncing if possible as an examination of the data files may help progress the investigation.

Thank you,
Thomas

Comment by Shuo Wang [ 09/Oct/16 ]

"A mongodb instance was unexpectedly closed" means that the process suddenly crashed by itself. After that, I tried to start it, but it didn't work.
There wasn't any error informations in the system logs, such as /var/log/messages, and so on. I only found the mongodb log recorded some error informations.
The mongodb runs in VM, you can understand the VM is AWS EC2. And the storage device is similar to AWS EBS.

Comment by Keith Bostic (Inactive) [ 08/Oct/16 ]

Thank you for the report. We are investigating this issue, however, please be aware that determining the root cause of data corruption without a reproduction is challenging.

Can you describe further what you mean by "A mongodb instance was unexpectedly closed"?

Can you please tell us if there any errors recorded in the system logs, either before, or around the time of the failure?

What kind of underlying storage mechanism are you using? Are the storage devices attached locally or over the network? Are the disks SSDs or HDDs? What kind of RAID and/or volume management system are you using?

Comment by Shuo Wang [ 08/Oct/16 ]

There was another mongodb unexpectedly closed in a few days ago. The log information as follow:

2016-10-04T02:44:34.634+0800 E STORAGE [repl writer worker 16] WiredTiger (0) [1475520274:634770][8619:0x7f3d4b742700], file:collection-310--1292709067801170191.wt, WT_CURSOR.search: read checksum error for 12288B block at offset 139472896: calculated block checksum of 2825332407 doesn't match expected checksum of 2211965436
2016-10-04T02:44:34.634+0800 E STORAGE [repl writer worker 16] WiredTiger (0) [1475520274:634823][8619:0x7f3d4b742700], file:collection-310--1292709067801170191.wt, WT_CURSOR.search: collection-310--1292709067801170191.wt: encountered an illegal file format or internal value
2016-10-04T02:44:34.634+0800 E STORAGE [repl writer worker 16] WiredTiger (-31804) [1475520274:634836][8619:0x7f3d4b742700], file:collection-310--1292709067801170191.wt, WT_CURSOR.search: the process must exit and restart: WT_PANIC: WiredTiger library panic

Comment by Shuo Wang [ 08/Oct/16 ]

How can I modify the description? It's look like weird with unrecognized character.

Generated at Thu Feb 08 04:12:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.