[SERVER-31097] Two shards in cluster getting WT LIBRARY PANIC creating a simple index and every index retry crashes again Created: 14/Sep/17  Updated: 24/Sep/17  Resolved: 14/Sep/17

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: 3.4.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Lucas Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongodb.log.2017-09-14T19-26-15     File mongodb.log.2017-09-14T19-26-35     File new_database_corruption.7z    
Operating System: ALL
Participants:

 Description   

When creating a background index in our cluster with 7 shards (600 mi documents) and in one collection sharded by a hased index, the server continuously crashes.

We created this index:

2017-09-14T20:42:29.543+0000 I INDEX    [initandlisten] found 1 interrupted index build(s) on shipyard.investigation_cards
2017-09-14T20:42:29.543+0000 I INDEX    [initandlisten] note: restart the server with --noIndexBuildRetry to skip index rebuilds
2017-09-14T20:42:29.545+0000 I INDEX    [initandlisten] build index on: shipyard.investigation_cards properties: { v: 2, key: { account_id: 1, universe_id: 1, stilingue_array.call_id: 1, stilingue_array.page_id: 1, normalized_posted_at: 1 }, name: "sac_call_id", ns: "shipyard.investigation_cards", background: true }
2017-09-14T20:42:29.545+0000 I INDEX    [initandlisten] 	 building index using bulk method; build may temporarily use up to 500 megabytes of RAM

After some time building the MongoDB crashed with this error:

2017-09-14T20:43:58.517+0000 E STORAGE  [initandlisten] WiredTiger error (0) [1505421838:517507][852475:0x7f9e3c4b2d40], file:collection-22-3497018620930100997.wt, WT_CURSOR.next: read checksum error for 8192B block at offset 72198791168: block header checksum of 0 doesn't match expected checksum of 707510254
2017-09-14T20:43:58.517+0000 E STORAGE  [initandlisten] WiredTiger error (0) [1505421838:517551][852475:0x7f9e3c4b2d40], file:collection-22-3497018620930100997.wt, WT_CURSOR.next: collection-22-3497018620930100997.wt: encountered an illegal file format or internal value
2017-09-14T20:43:58.517+0000 E STORAGE  [initandlisten] WiredTiger error (-31804) [1505421838:517558][852475:0x7f9e3c4b2d40], file:collection-22-3497018620930100997.wt, WT_CURSOR.next: the process must exit and restart: WT_PANIC: WiredTiger library panic
2017-09-14T20:43:58.517+0000 I -        [initandlisten] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 361
2017-09-14T20:43:58.517+0000 I -        [initandlisten] 

I wil attach two log files. First one is the first crash (right after the index build start) and the second one is a subsequent crash.
If you guys needs more data I will need a secure portal to upload my data, because we have big files here. Unfortunately I can't upload any data files from this collection for security reasons.

When I started the server with the option --noIndexBuildRetry, it stops the crashes. I will make initial sync in those two servers because I'm not confident if this did not corrupted any data or index in my database.



 Comments   
Comment by Lucas [ 24/Sep/17 ]

Hello pasette, thanks for your comment.

Unfortunately I don't have access anymore to these shards files because they have been replaced with new ones and a initial sync has been made.

But I can say to you two things:

1. I already got this problem before (SERVER-23532) a long time ago.

2. Another replicaset (without ANY connections to these shards) crashed today with the WT library panic error. I know it can be a diferent problem, but its really strange. I will attach log files and metrics files when this happen and all wired tiger files and I really wish that was diagnosed.

It is very worrying that a database has so many chances of corrupting itself.

Thanks.

Comment by Daniel Pasette (Inactive) [ 15/Sep/17 ]

Hi Lucas,
Please attach the log files for the second shard server that crashed and we can check if there is might be some relation between the crashes. However, the log files for the one shard you include show that the server crashes at the exact same point in the index build with the same checksum error. That is quite strong evidence that it is an underlying storage corruption issue outside of the WT storage engine. What I think Ramon means about "checking the integrity of the storage layer" is to check for any storage sub-system errors on that machine in syslog.
Thanks,
Dan

Comment by Lucas [ 15/Sep/17 ]

You do not even think this was strange on the part of the description when I said TWO different shards crashes trying to create the index? Two different dedicated servers crashing at the same time when we was creating the index? Those MongoDB servers are alive for more than one month and I already indexed this same collection weeks ago.

And what do you mean about checking the integrity of the storage layer? Those shards are quite new to getting corrupted storages in this way, without any interruption and things like that. How can this data corruption happens?

And I know about the SERVER project, but unfortunately I can't agree this isn't nothing of weird and something like a bug.

Thanks.

Comment by Ramon Fernandez Marina [ 14/Sep/17 ]

These error messages indicate that the data on disks is corrupt. Even if you were able to upload the data I don't think we would be able to reconstruct it, so I'd recommend the following:

  • Check the integrity of the storage layer used by the affected nodes
  • Once the integrity of the storage layer is healthy, resync the affected nodes from healthy primaries.

Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Generated at Thu Feb 08 04:25:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.