[SERVER-17510] "Didn't find RecordId in WiredTigerRecordStore" on collections after an idle period Created: 09/Mar/15  Updated: 23/May/18  Resolved: 09/Mar/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.0
Fix Version/s: 3.0.1, 3.1.0

Type: Bug Priority: Critical - P2
Reporter: Michael Cahill (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: ET
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File Log_To_Upload     HTML File serverStatus() command    
Issue Links:
Related
is related to SERVER-17506 Race between inserts and checkpoints ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Steps To Reproduce:

No self-contained repro has been found. This bug has only been observed on complex systems. In particular, workloads that continually update collections will not hit this problem: only workloads where collections are idle for a period, then updated can trigger this bug,

Participants:

 Description   

A bug in an internal WiredTiger thread could cause corruption in collections that become idle, then are later updated again.

A WiredTiger thread that discards old collections in cache could (under rare circumstances) start discarding a collection but give up part way through, leaving an incomplete tree in cache. If that collection was subsequently updated, the on-disk tree could become corrupted. A repairDatabase operation may be required to salvage data.



 Comments   
Comment by Ramon Fernandez Marina [ 23/May/18 ]

onkarb, MongoDB 3.0 is EOL – please upgrade to a supported version (MongoDB 3.6 at the moment) and if the problem persists open a new ticket.

Thanks,
Ramón.

Comment by onkar [ 23/May/18 ]

Hi,

  We are using MongoDb 3.0.15 in standalone mode and  facing similar issue. Following log is observed in mongod.log while trying to backup database using mongodump command.

2018-05-22T01:43:50.759-0700 I QUERY [conn770575] assertion 28556 Didn't find RecordId in WiredTigerRecordStore ns:analytics_data_C88351b08d33664e013d8ad07ea32ff51.client_visit_data query:{ $query: {}, $snapshot: true }

We have recently migrated MongoDb 3.0.6 to 3.0.15. To fix this issue we even executed repairDatabase() command for all databases, but it looks like issue gets reproduced again. Also this issue is observed for more than 1 collections.

               I have attached complete back trace observed in mongod.log also I have attached O/p of db.serverStatus() command as a attachment files. I would like to understand

  1. Scenarios where Wiredtiger encounters this issue ?
  2. Also is there any workaround to recover database so that we can backup MongoDb database and avoid such a issue? Is this issue also dependent on number of databases/ collections present on single machine ? Log_To_UploadserverStatus() command

Thanks in advance.

Regards,

Onkar

Comment by Ramon Fernandez Marina [ 19/May/15 ]

sega, if you upgraded this replica set from 3.0.0 to 3.0.3 (possibly via 3.0.1 and/or 3.0.2) this is most likely SERVER-17506.

If you either installed this replica set from scratch, or upgraded them but never run 3.0.0, please open a separate ticket and upload all the relevant information (including logs for all nodes).

Thanks,
Ramón

Comment by Sergey I. Yarkin [ 19/May/15 ]

ramon.fernandez, I've replica set with 3 nodes with same version of mongod (3.0.3)

Comment by Ramon Fernandez Marina [ 19/May/15 ]

sega, were you running 3.0.0 at one point? If you were, it's possible that you were affected by SERVER-17506 but the effects did not manifest until now.

Comment by Sergey I. Yarkin [ 19/May/15 ]

Hi, I still have this issue in 3.0.3 (repairDatabase temporary resolve it)

Comment by Ramon Fernandez Marina [ 18/Mar/15 ]

Hi sallgeud, this issue was fixed in the 3.0.1 stable release and the 3.1.0 development release, both published yesterday.

Regards,
Ramón.

Comment by Chad Kreimendahl [ 18/Mar/15 ]

So this wasn't fixed?

Generated at Thu Feb 08 03:44:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.