-
Type: Question
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
Hello!
I'm running MongoDB 4.2.6 in a PSS replicaset.
On May 20 at 14:33 (UTC+2) my primary server crashed suddenly (no specific job or load) with this error:
2020-05-20T14:33:46.772+0200 E STORAGE [conn142228] WiredTiger error (0) [1589978026:771891][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_block_read_off, 274: collection-48--3355261872768217798.wt: read checksum error for 114688B block at offset 3694592000: calculated block checksum doesn't match expected checksum Raw: [1589978026:771891][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_block_read_off, 274: collection-48--3355261872768217798.wt: read checksum error for 114688B block at offset 3694592000: calculated block checksum doesn't match expected checksum ... (and then a dump of some memory) ... 2020-05-20T14:33:46.786+0200 E STORAGE [conn142228] WiredTiger error (-31802) [1589978026:786272][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_block_read_off, 292: collection-48--3355261872768217798.wt: fatal read error: WT_ERROR: non-specific WiredTiger error Raw: [1589978026:786272][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_block_read_off, 292: collection-48--3355261872768217798.wt: fatal read error: WT_ERROR: non-specific WiredTiger error 2020-05-20T14:33:46.786+0200 E STORAGE [conn142228] WiredTiger error (-31804) [1589978026:786278][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_panic, 490: the process must exit and restart: WT_PANIC: WiredTiger library panic Raw: [1589978026:786278][9164:0x7f653a038700], file:collection-48--3355261872768217798.wt, WT_CURSOR.next: __wt_panic, 490: the process must exit and restart: WT_PANIC: WiredTiger library panic 2020-05-20T14:33:46.786+0200 F - [conn142228] Fatal Assertion 50853 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 414 2020-05-20T14:33:46.786+0200 F - [conn142228] 2020-05-20T14:33:46.822+0200 F - [conn142228] Got signal: 6 (Aborted).
The fail-over went well fortunately, one of the secondary became primary. I had a look at this error which looks like corrupted data read (not sure if it's from disk or memory?) the server is fairly new, I ran some tests on the disk and memory but didn't find any problem. Can do you provide a bit more details about what could cause this error? anyway I could make sure if it's hardware or software? anything to investigate in mongodb code itself?
After that (May 20 at 22:21 UTC+2) I tried restarting mongo (without clearing the data dir) to see if it can recover by itself or not and ended up in a weird state: mongo seemed to start normally from the logs, I didn't see anything supicious related to corrupted data (maybe you'll find some). The log says the server went into rollback state and then secondary, but then talks about rollback again, here is an extract:
2020-05-20T22:21:33.748+0200 I REPL [rsBackgroundSync] Replication commit point: { ts: Timestamp(0, 0), t: -1 } 2020-05-20T22:21:33.748+0200 I REPL [rsBackgroundSync] Rollback using 'recoverToStableTimestamp' method. 2020-05-20T22:21:33.748+0200 I REPL [rsBackgroundSync] Scheduling rollback (sync source: db2.updn.io:27017) 2020-05-20T22:21:33.748+0200 I ROLLBACK [rsBackgroundSync] transition to ROLLBACK 2020-05-20T22:21:33.748+0200 I REPL [rsBackgroundSync] State transition ops metrics: { lastStateTransition: "rollback", userOpsKilled: 0, userOpsRunning: 30 } 2020-05-20T22:21:33.748+0200 I REPL [rsBackgroundSync] transition to ROLLBACK from SECONDARY ... 2020-05-20T22:21:33.748+0200 I ROLLBACK [rsBackgroundSync] Waiting for all background operations to complete before starting rollback ... 2020-05-20T22:21:33.749+0200 I ROLLBACK [rsBackgroundSync] Finished waiting for background operations to complete before rollback 2020-05-20T22:21:33.749+0200 I ROLLBACK [rsBackgroundSync] finding common point
But at this point, it was not catching up, not doing anything visible (no CPU or disk activity) and I could NOT connect to it, when starting the mongo shell it would initiate the connection but hang before displaying the prompt (no error). In the logs I only see a bunch of (probably from your monitoring daemon?):
2020-05-20T22:23:44.170+0200 I NETWORK [listener] connection accepted from 178.63.21.176:34998 #149 (75 connections now open) 2020-05-20T22:23:44.170+0200 I NETWORK [conn149] received client metadata from 178.63.21.176:34998 conn149: { driver: { name: "mongo-go-driver", version: "v1.1.4" }, os: { type: "linux", architecture: "amd64" }, platform: "go1.13.4" } 2020-05-20T22:23:44.182+0200 I NETWORK [conn144] end connection 178.63.21.176:34902 (74 connections now open)
Impossible to get a shell and the server doesn't move. Even doing "sudo systemctl stop mongodb" did NOT stop the server, it hanged. I had to send a SIGTERM to stop it (I think systemctl also send SIGTERM so not sure why, maybe it was just slow and I got lucky on the timing).
I think there was not enough headway on the other servers to catch-up sync so it may be expected that this server stayed stuck, but I did not expect to be unable to get prompt and send commands. Is this expected?
After that I decided to clear the data dir and do an initial sync so unfortunately I don't have the diagnostic.data any more, but I have the complete log if you want.