[SERVER-48389] WT read checksum error, what could be the cause? Created: 23/May/20  Updated: 27/May/20  Resolved: 27/May/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Adrien Jarthon Assignee: Carl Champain (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Hello!
I'm running MongoDB 4.2.6 in a PSS replicaset.

On May 20 at 14:33 (UTC+2) my primary server crashed suddenly (no specific job or load) with this error:

2020-05-20T14:33:46.772+0200 E  STORAGE  [conn142228] WiredTiger error (0) [1589978026:771891][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_block_read_off, 274: collection-48--3355261872768217798.wt: read checksum error for 114688B block at offset 3694592000: calculated block checksum  doesn't match expected checksum Raw: [1589978026:771891][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_block_read_off, 274: collection-48--3355261872768217798.wt: read checksum error for 114688B block at offset 3694592000: calculated block checksum  doesn't match expected checksum
...
(and then a dump of some memory)
...
2020-05-20T14:33:46.786+0200 E  STORAGE  [conn142228] WiredTiger error (-31802) [1589978026:786272][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_block_read_off, 292: collection-48--3355261872768217798.wt: fatal read error: WT_ERROR: non-specific WiredTiger error Raw: [1589978026:786272][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_block_read_off, 292: collection-48--3355261872768217798.wt: fatal read error: WT_ERROR: non-specific WiredTiger error
2020-05-20T14:33:46.786+0200 E  STORAGE  [conn142228] WiredTiger error (-31804) [1589978026:786278][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_panic, 490: the process must exit and restart: WT_PANIC: WiredTiger library panic Raw: [1589978026:786278][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_panic, 490: the process must exit and restart: WT_PANIC: WiredTiger library panic
2020-05-20T14:33:46.786+0200 F  -        [conn142228] Fatal Assertion 50853 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 414
2020-05-20T14:33:46.786+0200 F  -        [conn142228] 
2020-05-20T14:33:46.822+0200 F  -        [conn142228] Got signal: 6 (Aborted).

The fail-over went well fortunately, one of the secondary became primary. I had a look at this error which looks like corrupted data read (not sure if it's from disk or memory?) the server is fairly new, I ran some tests on the disk and memory but didn't find any problem. Can do you provide a bit more details about what could cause this error? anyway I could make sure if it's hardware or software? anything to investigate in mongodb code itself?

After that (May 20 at 22:21 UTC+2) I tried restarting mongo (without clearing the data dir) to see if it can recover by itself or not and ended up in a weird state: mongo seemed to start normally from the logs, I didn't see anything supicious related to corrupted data (maybe you'll find some). The log says the server went into rollback state and then secondary, but then talks about rollback again, here is an extract:

2020-05-20T22:21:33.748+0200 I  REPL     [rsBackgroundSync] Replication commit point: { ts: Timestamp(0, 0), t: -1 }
2020-05-20T22:21:33.748+0200 I  REPL     [rsBackgroundSync] Rollback using 'recoverToStableTimestamp' method.
2020-05-20T22:21:33.748+0200 I  REPL     [rsBackgroundSync] Scheduling rollback (sync source: db2.updn.io:27017)
2020-05-20T22:21:33.748+0200 I  ROLLBACK [rsBackgroundSync] transition to ROLLBACK
2020-05-20T22:21:33.748+0200 I  REPL     [rsBackgroundSync] State transition ops metrics: { lastStateTransition: "rollback", userOpsKilled: 0, userOpsRunning: 30 }
2020-05-20T22:21:33.748+0200 I  REPL     [rsBackgroundSync] transition to ROLLBACK from SECONDARY
...
2020-05-20T22:21:33.748+0200 I  ROLLBACK [rsBackgroundSync] Waiting for all background operations to complete before starting rollback
...
2020-05-20T22:21:33.749+0200 I  ROLLBACK [rsBackgroundSync] Finished waiting for background operations to complete before rollback
2020-05-20T22:21:33.749+0200 I  ROLLBACK [rsBackgroundSync] finding common point

But at this point, it was not catching up, not doing anything visible (no CPU or disk activity) and I could NOT connect to it, when starting the mongo shell it would initiate the connection but hang before displaying the prompt (no error). In the logs I only see a bunch of (probably from your monitoring daemon?):

2020-05-20T22:23:44.170+0200 I  NETWORK  [listener] connection accepted from 178.63.21.176:34998 #149 (75 connections now open)
2020-05-20T22:23:44.170+0200 I  NETWORK  [conn149] received client metadata from 178.63.21.176:34998 conn149: { driver: { name: "mongo-go-driver", version: "v1.1.4" }, os: { type: "linux", architecture: "amd64" }, platform: "go1.13.4" }
2020-05-20T22:23:44.182+0200 I  NETWORK  [conn144] end connection 178.63.21.176:34902 (74 connections now open)

Impossible to get a shell and the server doesn't move. Even doing "sudo systemctl stop mongodb" did NOT stop the server, it hanged. I had to send a SIGTERM to stop it (I think systemctl also send SIGTERM so not sure why, maybe it was just slow and I got lucky on the timing).

I think there was not enough headway on the other servers to catch-up sync so it may be expected that this server stayed stuck, but I did not expect to be unable to get prompt and send commands. Is this expected?

After that I decided to clear the data dir and do an initial sync so unfortunately I don't have the diagnostic.data any more, but I have the complete log if you want.



 Comments   
Comment by Carl Champain (Inactive) [ 27/May/20 ]

Yes, this error is most likely about disk failure. 

Comment by Adrien Jarthon [ 27/May/20 ]

Ok thanks, so this error is more about disk failure not RAM? As I don't have the data dir (I have most of the rest though) let's not spend our time trying to dig more and it'll probably lead nowhere. If I see this again I'll definitely try to keep the data dir.

Comment by Carl Champain (Inactive) [ 27/May/20 ]

Hi bigbourin@gmail.com,

The WT read checksum error leads us to suspect some form of physical corruption. If you encounter this problem again in the future holding onto a copy of the database's $dbpath directory would be helpful. Our ability to determine the source of corruption would also depend on your ability to provide:

  1. The logs for the affected node, including before, leading up to, and after the first sign of corruption.
  2. A description of the underlying storage mechanism in use, including details like:
    1. What file system and/or volume management system is in use?
    2. Is data storage locally attached or network-attached?
    3. Are disks RAIDed and if so how?
    4. Are disks SSDs or HDDs?
  3. A description of your backup method, if any.
  4. A description of your disks have been recently checked for integrity?
  5. A history of the deployment, including:
    1. a timeline of version changes
    2. a timeline of hardware upgrade/downgrade cycles or configuration changes
    3. a timeline of disaster recovery or backup restoration activities
    4. a timeline of any manipulations of the underlying database files, including copies or moves, and information about whether mongod was running during each manipulation.

We recommend you keep an eye out for that host.

Is this expected?

The ideal resolution is to perform a clean resync from an unaffected node. Otherwise unexpected behavior may occur and the corruption problem won't be fixed.

I will now close this ticket.

Kind regards,
Carl
 

Generated at Thu Feb 08 05:16:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.