[SERVER-31832] Getting error while taking backup or selecting a record in a collection Created: 04/Nov/17  Updated: 14/Aug/18  Resolved: 14/Nov/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sumit Assignee: Mark Agarunov
Resolution: Done Votes: 0
Labels: envh, rns, rpu, trcf, wtc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File Nov-09-2017.rar     HTML File WiredTiger     File WiredTiger.lock     File WiredTiger.turtle     File WiredTiger.wt     File WiredTigerLAS.wt     File _mdb_catalog.wt     File repair-SERVER-31832-2.tar.gz     File repair-SERVER-31832.tar.gz     File sizeStorer.wt     File storage.bson    
Operating System: ALL
Participants:

 Comments   
Comment by Mark Agarunov [ 14/Nov/17 ]

Hello sumit.jain@iongroup.com,

I'm sorry to hear that the repair didn't fix the issue. Unfortunately this indicates that there is irreparable corruption on the disk, so the only course of action would be to resync the affected node or restore from a backup if possible.

Thanks,
Mark

Comment by Sumit [ 10/Nov/17 ]

Hi Mark,

The repair files didn't resolve the issue. Not sure if there any any other options that we can try.

Thanks
Sumit

Comment by Sumit [ 09/Nov/17 ]

Thanks Mark. We are working on it. I will let you know the status soon.
Do you know if this issue is common to MongoDB and was faced by so many other users? I am worried if it is more frequent in Production, we will not be able to use MongoDB nosql.

Comment by Mark Agarunov [ 09/Nov/17 ]

Hello sumit.jain@iongroup.com,

I've attached a repair attempt of the files you've provided as repair-SERVER-31832-2.tar.gz. Please extract these files and replace them in your $dbpath and let us know if it resolves the issue.

Thanks,
Mark

Comment by Sumit [ 09/Nov/17 ]

Hi Mark,

Thanks for getting back to us. I attached another set of WT files, right now out Mongo databases are offline. Will be great if you please send another set of repair files, and will try to restart the MongoDB service. The attached rar file is Nov-09-2017.rar
We are worried if these issues can happen in production and if we are not able to recover the databases if will be a big issue for us.

Thanks
Sumit

Comment by Mark Agarunov [ 07/Nov/17 ]

Hello sumit.jain@iongroup.com,

Thank you for the response. Unfortunately, this error indicates that there was corruption on the disk. In this situation, my best recommendation would be to resync the affected node or restore from a backup if possible.

Thanks,
Mark

Comment by Sumit [ 07/Nov/17 ]

Thanks Mark.

Even after copying the files, we still got the below error, can you please look into this on priority. Its in our Production environment.

2017-11-06T16:23:29.889-0500 E STORAGE [conn10] WiredTiger error (0) [1510003409:889971][13956:2008429440], file:Prod01/collection-0--7662079245466072202.wt, WT_CURSOR.next: read checksum error for 901120B block at offset 163971072: calculated block checksum of 3779091760 doesn't match expected checksum of 1400883067
2017-11-06T16:23:29.889-0500 E STORAGE [conn10] WiredTiger error (0) [1510003409:889971][13956:2008429440], file:Prod01/collection-0--7662079245466072202.wt, WT_CURSOR.next: Prod01/collection-0--7662079245466072202.wt: encountered an illegal file format or internal value
2017-11-06T16:23:29.889-0500 E STORAGE [conn10] WiredTiger error (-31804) [1510003409:889971][13956:2008429440], file:Prod01/collection-0--7662079245466072202.wt, WT_CURSOR.next: the process must exit and restart: WT_PANIC: WiredTiger library panic
2017-11-06T16:23:29.889-0500 I - [conn10] Fatal Assertion 28558 at src\mongo\db\storage\wiredtiger\wiredtiger_util.cpp 361
2017-11-06T16:23:29.889-0500 I - [conn10]

***aborting after fassert() failure

2017-11-06T16:23:29.891-0500 I - [conn9] Fatal Assertion 28559 at src\mongo\db\storage\wiredtiger\wiredtiger_util.cpp 64
2017-11-06T16:23:29.891-0500 I - [conn9]

-----------------------------
-----------------------------

  • What kind of underlying storage mechanism are you using? Are the storage devices attached locally or over the network? Are the disks SSDs or HDDs? What kind of RAID and/or volume management system are you using?
    [Sumit}: We use pure array disks which are on SAN server.
  • Would you please check the integrity of your disks?
    [Sumit}: Our netops team already checked that
  • Has the database always been running this version of MongoDB? If not please describe the upgrade/downgrade cycles the database has been through.
    [Sumit}: Yes it is 3.4. We recently installed this version in production for the first time.
  • Have you manipulated (copied or moved) the underlying database files? If so, was mongod running?
    [Sumit}: No, we do take the disk snapshots but that should not impact the MongoDB
  • Have you ever restored this instance from backups?
    [Sumit}: Yes we did
  • What method do you use to create backups?
    [Sumit}: We use MongoDump
  • When was the underlying filesystem last checked and is it currently marked clean?
    [Sumit}: Its a new installation in production we did last month. I believe yes
Comment by Mark Agarunov [ 06/Nov/17 ]

Hello sumit.jain@iongroup.com,

Thank you for the report. I've attached a repair attempt of the files you've provided. Would you please extract these files and replace them in your $dbpath and let us know if it resolves the issue? If you are still seeing errors after replacing these files, please provide the complete logs from mongod so that we can further investigate. Additionally, if this issue persists, please provide the following information:

  1. What kind of underlying storage mechanism are you using? Are the storage devices attached locally or over the network? Are the disks SSDs or HDDs? What kind of RAID and/or volume management system are you using?
  2. Would you please check the integrity of your disks?
  3. Has the database always been running this version of MongoDB? If not please describe the upgrade/downgrade cycles the database has been through.
  4. Have you manipulated (copied or moved) the underlying database files? If so, was mongod running?
  5. Have you ever restored this instance from backups?
  6. What method do you use to create backups?
  7. When was the underlying filesystem last checked and is it currently marked clean?

Thanks,
Mark

Comment by Sumit [ 04/Nov/17 ]

Please help this is in Production environment.

This is what I can see in the log file, attached are the WiredTiger files:

2017-11-03T22:21:37.846-0400 E STORAGE [conn71] WiredTiger error (0) [1509762097:846341][4664:140733117239424], file:Prod01/collection-0--7662079245466072202.wt, WT_CURSOR.next: read checksum error for 901120B block at offset 163971072: calculated block checksum of 233622082 doesn't match expected checksum of 1400883067
2017-11-03T22:21:37.847-0400 E STORAGE [conn71] WiredTiger error (0) [1509762097:846341][4664:140733117239424], file:Prod01/collection-0--7662079245466072202.wt, WT_CURSOR.next: Prod01/collection-0--7662079245466072202.wt: encountered an illegal file format or internal value
2017-11-03T22:21:37.847-0400 E STORAGE [conn71] WiredTiger error (-31804) [1509762097:846341][4664:140733117239424], file:Prod01/collection-0--7662079245466072202.wt, WT_CURSOR.next: the process must exit and restart: WT_PANIC: WiredTiger library panic
2017-11-03T22:21:37.847-0400 I - [conn71] Fatal Assertion 28558 at src\mongo\db\storage\wiredtiger\wiredtiger_util.cpp 361
2017-11-03T22:21:37.847-0400 I - [conn71]

***aborting after fassert() failure

2017-11-03T22:21:37.878-0400 I - [WTJournalFlusher] Fatal Assertion 28559 at src\mongo\db\storage\wiredtiger\wiredtiger_util.cpp 64
2017-11-03T22:21:37.878-0400 I - [WTJournalFlusher]

Generated at Thu Feb 08 04:28:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.