[SERVER-71661] Inconsistency between nodes Created: 22/Nov/22  Updated: 19/Jan/23  Resolved: 19/Jan/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.14
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Li xin Assignee: Eric Sedor
Resolution: Done Votes: 0
Labels: Bug
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu mongodb 4.2.14


Attachments: PNG File err.png    
Participants:

 Description   

I found that part of the data read from the primary is different from the one written . The data of the same _id is inconsistent in primary and secondary
like this



 Comments   
Comment by Eric Sedor [ 27/Dec/22 ]

Hi bingfeng198878@163.com, I wanted to see if you'd be able to provide any of the above information. If not, I will close this as you requested.

Comment by Eric Sedor [ 28/Nov/22 ]

Hi bingfeng198878@163.com, I am treating this as a SERVER ticket rather than a WT ticket initially, as this seems part of your use of MongoDB. To start with, please make a complete copy of the node's $dbpath directory to safeguard so that you can work off of the current $dbpath.

Our ability to determine the source of this issue depends greatly on your ability to provide:

  1. The logs for the affected node, including before, leading up to, and after the first sign of corruption.
  2. As much of syslog and dmesg content leading up to the first sign of corruption as possible.
  3. A description of the underlying storage mechanism in use, including details like:
    1. What file system and/or volume management system is in use?
    2. Is data storage locally attached or network-attached?
    3. Are disks RAIDed and if so how?
    4. Are disks SSDs or HDDs?
  4. A description of your backup method, if any.
  5. A description of your disks have been recently checked for integrity?
  6. A history of the deployment, including:
    1. a timeline of version changes
    2. a timeline of hardware upgrade/downgrade cycles or configuration changes
    3. a timeline of disaster recovery or backup restoration activities
    4. a timeline of any manipulations of the underlying database files, including copies or moves, and information about whether mongod was running during each manipulation.

The ideal resolution is to perform a clean resync from an unaffected node. If you can provide us with the answers above, we can investigate further.

Comment by Li xin [ 24/Nov/22 ]

I think this issue can be closed, the probability is “bit flip”

Comment by Li xin [ 22/Nov/22 ]

I query on primary

 

hn_blkmaster_offline_shard5:PRIMARY> db.stripe_delete_file.find({ "_id" : NumberLong("91236400395")})
{ "_id" : NumberLong("91236400395"), "mode" : NumberLong(203780), "sid" : NumberLong(72332451), "offset" : NumberLong("27487795040029"), "size" : NumberLong(58254) }

I query on other 3 secondarys

 

 

hn_blkmaster_offline_shard5:SECONDARY> db.stripe_delete_file.find({ "_id" : NumberLong("91236400395")})
{ "_id" : NumberLong("91236400395"), "mode" : NumberLong(203780), "sid" : NumberLong(72332451), "offset" : NumberLong(4345629), "size" : NumberLong(58254) }

offset field is different. 

 

I write majory, and no delay between primary and secondary.

The hexadecimal format of the offset field of primary is  0x190000424f1d,the hexadecimal format of the offset field of secondary is 0x424f1d (is NumberLong(4345629) ). So I think the data on primary highest bit has jumped from 0x00 to0x19(other cases 0x00 to 0x12 ) .

At least 32 wrong data

 

Generated at Thu Feb 08 06:19:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.