[SERVER-57379] "WiredTiger error when building index. Created: 03/Jun/21  Updated: 17/Jun/21  Resolved: 15/Jun/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: jongsun moon Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: build, index
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File after_repair_operation_first_start.log     Text File mongod-log.log     Text File mongod_build_index_error_2021_05_26.log     Text File mongod_config_server_not_work_2021_05_25.log     Text File repair_20210611.log     Text File shard_2021_06_08.log     Text File shard_2021_06_08_1.log     Text File shard_2021_06_08_2-1.log     Text File shard_2021_06_08_2.log     File shard_cn_1.log.2021-05-25T05-38-59     File shard_cn_1.log.2021-05-26T05-23-19     File shard_cn_1.log.2021-06-11T01-21-58    
Operating System: ALL
Participants:

 Description   

We have a collection "A".

Collection "A" stats.

  • count : 3852675627
  • size : 3.31TB
  • storageSize : 481.60 GB

When we were building a compound index,
we got "WiredTiger error ... read checksum error for 16384B block at offset 39527702528: calculated block checksum doesn't match expected checksum Raw:".

 

We experienced similar error, at that time we changed the physical server to avoid error.(We already issued that situation here.)

 

So when we faced this error again, we changed physical server.
But in this time we could not make it.

 

What we did.
1. we prepared new physical server.
=> we got "WiredTiger error" again.
2. we changed ssd.
=> we got "WiredTiger error" again.
3. We considered disk problem where collection "A" was written, we renamed collection "A" to "A_old" and made collection "A" again newly.
=> we got "WiredTiger error" again.
4. We tried build index with some options ( -sparse, -name, - split index key... )
=> we got "WiredTiger error" again.

Environment.

  • Server
  • OS : Ubuntu 18.04.5 LTS
  • Memory : 64Gb
  • Disk : SSD 7.3T (4T + 4T with lvm)
  • Mongodb :
    db version v4.2.8
    git version: 43d25964249164d76d5e04dd6cf38f6111e21f5f
    OpenSSL version: OpenSSL 1.1.1 11 Sep 2018
    allocator: tcmalloc
    modules: none
    build environment:
    distmod: ubuntu1804
    distarch: x86_64
    target_arch: x86_64

 

Do you have any advice?

 



 Comments   
Comment by jongsun moon [ 17/Jun/21 ]

Hi Edwin Zhou,

 

Thank you very much for  your help.

 

Regards,

 

Comment by Edwin Zhou [ 15/Jun/21 ]

Hi jjongei0@gmail.com,

Unfortunately we aren't able to repair this level of corruption.

To avoid a problem like this in the future, it is our strong recommendation to:

Best,
Edwin

Comment by jongsun moon [ 15/Jun/21 ]

Hi Edwin Zhou,

We have only one mongod server which is set as replica set 0(One mongos + one config server + one mongod server).

So we can not do a clean resync.

Is there any other thing to do for it?

 

Comment by Edwin Zhou [ 11/Jun/21 ]

Hi jjongei0@gmail.com,

Given that your topology appears to be a replica set, we may be able to resolve this corruption by doing a clean resync from an unaffected node.

We look forward to hearing if you're able to find success with this method.

Best,
Edwin

Comment by jongsun moon [ 11/Jun/21 ]

Hi Edwin Zhou,

Unfortunately, --repair operation was failed.

Regards,

Comment by Edwin Zhou [ 09/Jun/21 ]

Hi jjongei0@gmail.com,

Thank you for providing the output of the db.collection.validate command. It indeed confirms that there is collection level corruption. We look forward to hearing back to you with the results of running --repair.

In the event that a --repair operation is unsuccessful, then please also provide:

  • The logs leading up to the first occurrence of any issue
  • The logs of the repair operation.
  • The logs of any attempt to start mongod after the repair operation completed.

Best,
Edwin

Comment by jongsun moon [ 08/Jun/21 ]

 

other collections validation result

shard_2021_06_08_1.log

shard_2021_06_08_2.log

shard_2021_06_08_1.log

 

At the same database, I made same collection and built index (previous collection was renamed) and I received below.

2021-06-08T09:01:04.852+0000 I  -        [conn205]   Index Build: scanning collection: 65707300/1069318918 6%
2021-06-08T09:01:07.002+0000 I  -        [conn205]   Index Build: scanning collection: 66091200/1069318918 6%
 
2021-06-08T09:01:08.099+0000 E  -        [conn205] Assertion: Location10320: BSONElement: bad type 67 src/mongo/bson/bsonelement.cpp 696
 
2021-06-08T09:01:08.147+0000 I  STORAGE  [conn205] Deferring table drop for index 'KWID_RANK_OF_RISINGKW_RANK_OF_NEWKW_RANK_OF_SPECIFICKW_RANK_OF_ASNKW_RANK_OF_ASNMNAKW_RANK_OF_TRIALKW_RANK_OF_SUITKW_COPAT_CNT_TOT_KW_CATEGORY_LOCALE' on collection 'IPOFFICE_210201_CN.TBL_FEA_KW_KW.$KWID_RANK_OF_RISINGKW_RANK_OF_NEWKW_RANK_OF_SPECIFICKW_RANK_OF_ASNKW_RANK_OF_ASNMNAKW_RANK_OF_TRIALKW_RANK_OF_SUITKW_COPAT_CNT_TOT_KW_CATEGORY_LOCALE (5c2dbce3-4700-4507-8fe3-7b39d05ae7ee)'. Ident: 'IPOFFICE_210201_CN/index-2--1086201584841391775', commit timestamp: 'Timestamp(1623142859, 1)'

 And then I ran collection validation and found corrupted records.

2021-06-08T10:22:41.741+0000 I STORAGE [conn225] WiredTiger progress WT_SESSION.verify 8775600
2021-06-08T10:22:41.746+0000 I STORAGE [conn225] WiredTiger progress WT_SESSION.verify 8775700

2021-06-08T10:24:14.892+0000 I STORAGE [conn225] document at location: RecordId(66267652) is corrupted

2021-06-08T10:28:34.848+0000 I ACCESS [conn238] Successfully authenticated as principal __system on local from client 192.168.0.165:47190
2021-06-08T10:28:51.515+0000 I STORAGE [conn225] document at location: RecordId(241390443) is corrupted
2021-06-08T10:28:57.786+0000 I STORAGE [conn225] document at location: RecordId(245063106) is corrupted
2021-06-08T10:32:30.522+0000 I STORAGE [conn225] document at location: RecordId(361048613) is corrupted

 

Now I am doing "mongod --repair". 

 

 

Comment by jongsun moon [ 08/Jun/21 ]

Hi Edwin Zhou.

Thank you for your reply.

I ran db.collection.validate.

Then I got message below. (I attached log file at the bottom.)

And file size ("collection-20-7619064565122544008.wt" which faced error in log) is 482GB.

Is there anything to do for it?

Regards,

 

 

2021-06-08T01:26:17.551+0000 I STORAGE [conn18] WiredTiger progress WT_SESSION.verify 2562300
2021-06-08T01:26:17.563+0000 I STORAGE [conn18] WiredTiger progress WT_SESSION.verify 2562400
2021-06-08T01:26:17.570+0000 E STORAGE [conn18] WiredTiger error (0) [1623115577:570311][15886:0x7f2d2b6a3700], file:IPOFFICE_210201_CN/collection-20-7619064565122544008.wt, WT_SESSION.verify: __wt_block_read_off, 274: IPOFFICE_210201_CN/collection-20-7619064565122544008.wt: read checksum error for 16384B block at offset 39527702528: calculated block checksum doesn't match expected checksum
2021-06-08T01:26:17.570+0000 E STORAGE [conn18] WiredTiger error (0) [1623115577:570399][15886:0x7f2d2b6a3700], file:IPOFFICE_210201_CN/collection-20-7619064565122544008.wt, WT_SESSION.verify: __wt_bm_corrupt_dump, 135: {39527702528, 16384, 0x578c57a3}: (chunk 1 of 16): 0 ....

...

2021-06-08T01:26:18.422+0000 E STORAGE [conn18] WiredTiger error (0) [1623115578:422514][15886:0x7f2d2b6a3700], file:IPOFFICE_210201_CN/collection-20-7619064565122544008.wt, WT_SESSION.verify: __verify_ckptfrag_chk, 526: checkpoint ranges never verified: 375
2021-06-08T01:26:18.762+0000 E STORAGE [conn18] WiredTiger error (0) [1623115578:762395][15886:0x7f2d2b6a3700], file:IPOFFICE_210201_CN/collection-20-7619064565122544008.wt, WT_SESSION.verify: __verify_filefrag_chk, 438: file ranges never verified: 375
2021-06-08T01:26:18.762+0000 E STORAGE [conn18] verify() returned WT_ERROR: non-specific WiredTiger error. This indicates structural damage. Not examining individual documents.
2021-06-08T01:26:18.769+0000 I COMMAND [ShardServerCatalogCacheLoader-2] command config.cache.collections command: find { find: "cache.collections", filter:

Unknown macro: { _id}

, ntoreturn: 1, singleBatch: true, $db: "config" } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:555 locks:{ ReplicationStateTransition: { acquireCount:

Unknown macro: { w}

}, Global: { acquireCount:

Unknown macro: { r}

}, Database: { acquireCount:

Unknown macro: { r}

}, Collection: { acquireCount:

Unknown macro: { r}

}, Mutex: { acquireCount:

Unknown macro: { r}

} } storage:{ data:

Unknown macro: { bytesRead}

, timeWaitingMicros: { schemaLock: 164161664 } } protocol:op_msg 164164ms
2021-06-08T01:26:18.771+0000 I WRITE [ShardServerCatalogCacheLoader-1] update config.cache.databases command: { q:

Unknown macro: { _id}

, u: { $set: { _id: "config", primary: "config", partitioned: true, version:

Unknown macro: { uuid}

} }, multi: false, upsert: true } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 numYields:1 locks:{ ParallelBatchWriterMode: { acquireCount:

Unknown macro: { r}

}, ReplicationStateTransition: { acquireCount:

Unknown macro: { w}

}, Global: { acquireCount:

Unknown macro: { w}

}, Database: { acquireCount:

Unknown macro: { w}

}, Collection: { acquireCount:

Unknown macro: { w}

}, Mutex: { acquireCount:

Unknown macro: { r}

} } flowControl:{ acquireCount: 2, timeAcquiringMicros: 1 } storage:{ data:

Unknown macro: { bytesRead}

, timeWaitingMicros: { schemaLock: 164163963 } } 164166ms

...

 

shard_2021_06_08.log

 

Regards,

 

 

 

 

 

 

Comment by Edwin Zhou [ 07/Jun/21 ]

Hi jjongei0@gmail.com,

It appears there may be some corruption on the collection level. I advise you to first make a complete copy of the database's $dbpath directory to safeguard so that you can work off of the current $dbpath.

Then, can you run db.collection.validate on the affected collection?

Best,
Edwin

Generated at Thu Feb 08 05:41:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.