[SERVER-28157] Can't start secondary, data corrupted after clean shutdown Created: 01/Mar/17  Updated: 27/Jul/18  Resolved: 21/Sep/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Stephen Machnowski Assignee: Kelsey Schubert
Resolution: Done Votes: 1
Labels: envm, rns, wtc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File messages     File metrics.2017-03-01T13-45-17Z-00000     File metrics.2017-03-01T16-39-13Z-00000     File metrics.2017-03-01T16-47-40Z-00000     File metrics.2017-03-01T17-10-06Z-00000     Text File mongod.log    
Operating System: Linux
Participants:

 Description   

We have a development DB, 3 member replica set running Mongo DB server 3.4.2 on Amazon Linux (EC2 t2.medium), recently upgraded from 3.2.x.
rs.status() returned all members were healthy.
We restarted each of the 3 members to enable a configuration change for logRotate in the following order:
secondary 3 (hidden:true, priority:0, port:29019),
secondary 2 (priority:20, port:29018),
primary (priority 30, port:29017)
They were each restarted using: service mongod restart with a short pause between each one.
This database had no clients connected at the time.
After restarting all 3, in mongo client we issued rs.status() and noticed "primary" and "secondary 2" were healthy but "secondary 3" was not healthy, and could not be contacted. It can also no longer be started.
Looking at "secondary 3" the first error is following:

2017-03-01T15:55:00.534+0000 E STORAGE  [repl writer worker 8] WiredTiger error (0) [1488383700:534159][9280:0x7f8edc3d3700], file:index-14-4040745961588520825.wt, WT_SESSION.open_cursor: read checksum error for 4096B block at offset 45056: block header checksum of 3605474371 doesn't match expected checksum of 3806882032
2017-03-01T15:55:00.534+0000 E STORAGE  [repl writer worker 8] WiredTiger error (0) [1488383700:534218][9280:0x7f8edc3d3700], file:index-14-4040745961588520825.wt, WT_SESSION.open_cursor: index-14-4040745961588520825.wt: encountered an illegal file format or internal value
2017-03-01T15:55:00.534+0000 E STORAGE  [repl writer worker 8] WiredTiger error (-31804) [1488383700:534226][9280:0x7f8edc3d3700], file:index-14-4040745961588520825.wt, WT_SESSION.open_cursor: the process must exit and restart: WT_PANIC: WiredTiger library panic
2017-03-01T15:55:00.534+0000 I -        [repl writer worker 8] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 361
2017-03-01T15:55:00.534+0000 I -        [repl writer worker 8]
***aborting after fassert() failure
2017-03-01T15:55:00.570+0000 F -        [repl writer worker 8] Got signal: 6 (Aborted).

Now immediately on restart we get logged:

2017-03-01T16:39:22.408+0000 E STORAGE  [repl writer worker 7] WiredTiger error (0) [1488386362:408202][9783:0x7fde825da700], file:index-14-4040745961588520825.wt, WT_SESSION.open_cursor: read checksum error for 4096B block at offset 45056: block header checksum of 3605474371 doesn't match expected checksum of 3806882032
2017-03-01T16:39:22.408+0000 E STORAGE  [repl writer worker 7] WiredTiger error (0) [1488386362:408228][9783:0x7fde825da700], file:index-14-4040745961588520825.wt, WT_SESSION.open_cursor: index-14-4040745961588520825.wt: encountered an illegal file format or internal value
2017-03-01T16:39:22.408+0000 E STORAGE  [repl writer worker 7] WiredTiger error (-31804) [1488386362:408245][9783:0x7fde825da700], file:index-14-4040745961588520825.wt, WT_SESSION.open_cursor: the process must exit and restart: WT_PANIC: WiredTiger library panic
2017-03-01T16:39:22.408+0000 I -        [repl writer worker 7] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 361
2017-03-01T16:39:22.408+0000 I -        [repl writer worker 7]
***aborting after fassert() failure
2017-03-01T16:39:22.408+0000 I -        [repl writer worker 15] Fatal Assertion 28559 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 64
2017-03-01T16:39:22.408+0000 I -        [repl writer worker 15]
 
***aborting after fassert() failure



 Comments   
Comment by Kelsey Schubert [ 21/Sep/17 ]

Hi smachnowski@thinkmap.com and yasek,

Sorry for the delay getting back to you. From the information provided this appears to be most likely the result of disk corruption outside of MongoDB. Unfortunately, in cases like this it is very challenging to identify the root cause with more specificity without a reproduction.

Kind regards,
Kelsey

Comment by Siarhiej Jaskievi? [ 01/May/17 ]

Hi!
It seems to me we ran into this problem. We have Mongodb 3.4.3 installation with several replicas. Suddenly one of our secondaries crashed with such messages:

2017-05-01T11:47:14.970+0300 E STORAGE  [repl writer worker 7] WiredTiger error (0) [1493628434:970698][940374:0x7f24b7b02700], file:collection-89--8138649268142906461.wt, WT_CURSOR.search: read checksum error for 12288B block at offset 3202289664: calculated block checksum of 1977645955 doesn't match expected checksum of 1445311998
2017-05-01T11:47:14.970+0300 E STORAGE  [repl writer worker 7] WiredTiger error (0) [1493628434:970744][940374:0x7f24b7b02700], file:collection-89--8138649268142906461.wt, WT_CURSOR.search: collection-89--8138649268142906461.wt: encountered an illegal file format or internal value
2017-05-01T11:47:14.970+0300 E STORAGE  [repl writer worker 7] WiredTiger error (-31804) [1493628434:970758][940374:0x7f24b7b02700], file:collection-89--8138649268142906461.wt, WT_CURSOR.search: the process must exit and restart: WT_PANIC: WiredTiger library panic
2017-05-01T11:47:14.970+0300 I -        [repl writer worker 7] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 361
2017-05-01T11:47:14.970+0300 I -        [repl writer worker 11] Fatal Assertion 28559 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 64
2017-05-01T11:47:14.970+0300 I -        [repl writer worker 13] Fatal Assertion 28559 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 64
2017-05-01T11:47:14.970+0300 I -        [repl writer worker 12] Fatal Assertion 28559 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 64
2017-05-01T11:47:14.970+0300 I -        [repl writer worker 1] Fatal Assertion 28559 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 64
2017-05-01T11:47:14.970+0300 I -        [repl writer worker 4] Fatal Assertion 28559 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 64
2017-05-01T11:47:14.970+0300 I -        [repl writer worker 2] Fatal Assertion 28559 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 64
2017-05-01T11:47:14.971+0300 I -        [repl writer worker 3] Fatal Assertion 28559 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 64

Comment by Stephen Machnowski [ 01/Mar/17 ]

Attached are the full mongo log, diagnostics directory files, and var/log/messages system log.

Comment by Kelsey Schubert [ 01/Mar/17 ]

Hi smachnowski@thinkmap.com,

Thanks for reporting this behavior. So we can continue to investigate, would you please provide the following information?

  • The complete mongod log files for "secondary 3"
  • The a complete archive of the diagnostic.data directory of "secondary 3"
  • The system logs of the "secondary 3" host

Thank you again for your help,
Thomas

Generated at Thu Feb 08 04:17:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.