[SERVER-57179] Mongodb Database corrupted on Windows 10 Created: 25/May/21  Updated: 28/Jul/21  Resolved: 28/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nandkishor Chavan Assignee: Eric Sedor
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongodb.log.2021-05-03T15-11-20     File mongodb.log.2021-05-03T15-57-03     File mongodb.log.2021-05-03T19-29-50     File mongodb.log.2021-05-11T06-57-46     File mongodb.log.2021-05-11T07-00-55     File mongodb.log.2021-05-11T07-01-48    
Issue Links:
Duplicate
is duplicated by SERVER-57183 MongoDB data corrupted after service ... Closed
Operating System: ALL
Participants:

 Description   

Problem Description

We have found that after unexpected MongoDB service shutdown, the database is getting corrupted and it happens randomly

MongoDB Version 4.0.4 2008R2Plus

OS: Windows 10

After logging data for a couple of days we are getting data corruption issues. This is resulting in data loss for us.

Error in log file:

2021-05-11T12:27:48.920+0530 I CONTROL [main] Trying to start Windows service 'MongoDB'
2021-05-11T12:27:49.577+0530 I CONTROL [initandlisten] MongoDB starting : pid=4064 port=27017 dbpath=D:\ItantaDB 64-bit host=ESOS-2
2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] db version v4.0.4
2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] git version: f288a3bdf201007f3693c58e140056adf8b04839
2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] allocator: tcmalloc
2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] modules: none
2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] build environment:
2021-05-11T12:27:51.394+0530 I CONTROL [initandlisten] distmod: 2008plus-ssl
2021-05-11T12:27:51.394+0530 I CONTROL [initandlisten] distarch: x86_64
2021-05-11T12:27:51.394+0530 I CONTROL [initandlisten] target_arch: x86_64
2021-05-11T12:27:51.394+0530 I CONTROL [initandlisten] options: { net: { bindIpAll: true }, security: { authorization: "enabled" }, service: true, storage: { dbPath: "D:\ItantaDB", wiredTiger: { engineConfig: { cacheSizeGB: 1.0 } } }, systemLog: { destination: "file", path: "C:\ProgramData\ItantaAnalytics\mongodb.log" } }
2021-05-11T12:27:51.765+0530 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=1024M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress),
2021-05-11T12:27:54.334+0530 I STORAGE [initandlisten] WiredTiger message [1620716274:334351][4064:140735794009424], txn-recover: Main recovery loop: starting at 39/36112640 to 40/256
2021-05-11T12:27:54.436+0530 I STORAGE [initandlisten] WiredTiger message [1620716274:435293][4064:140735794009424], txn-recover: Recovering log 39 through 40
2021-05-11T12:27:54.644+0530 I STORAGE [initandlisten] WiredTiger message [1620716274:644163][4064:140735794009424], txn-recover: Recovering log 40 through 40
2021-05-11T12:27:54.706+0530 I STORAGE [initandlisten] WiredTiger message [1620716274:706126][4064:140735794009424], txn-recover: Set global recovery timestamp: 0
2021-05-11T12:27:55.782+0530 I RECOVERY [initandlisten] WiredTiger recoveryTimestamp. Ts: Timestamp(0, 0)
2021-05-11T12:27:56.769+0530 E STORAGE [initandlisten] WiredTiger error (2) [1620716276:740127][4064:140735794009424], file:collection-107--7568313936654431168.wt, WT_SESSION.open_cursor: __win_open_file, 565: D:\ItantaDB\collection-107--7568313936654431168.wt: handle-open: CreateFileW: The system cannot find the file specified.
: No such file or directory Raw: [1620716276:740127][4064:140735794009424], file:collection-107--7568313936654431168.wt, WT_SESSION.open_cursor: __win_open_file, 565: D:\ItantaDB\collection-107--7568313936654431168.wt: handle-open: CreateFileW: The system cannot find the file specified.
: No such file or directory
2021-05-11T12:27:56.769+0530 E STORAGE [initandlisten] Failed to open a WiredTiger cursor: table:collection-107--7568313936654431168
~2021-05-11T12:27:56.769+0530 E STORAGE [initandlisten] This may be due to data corruption. Please read the documentation for starting MongoDB with --repair here: http://dochub.mongodb.org/core/repair~
2021-05-11T12:27:56.769+0530 F - [initandlisten] Fatal Assertion 50882 at src\mongo\db\storage\wiredtiger\wiredtiger_session_cache.cpp 137
2021-05-11T12:27:56.769+0530 F - [initandlisten]

***aborting after fassert() failure

 

 

Additional Notes

We tried to repair the data using repair command. After repair, we are able to access the MongoDB. 

We have uploaded diagnostic data on Google Drive. Location: https://drive.google.com/file/d/1aMnqwqXSAUhiGbvALLboIXJFr3-jXt_-/view?usp=sharing 

Additional MongoDB service logs are added in attachment section.

We would like to know the reason why this issue might occurs and Is there any thing which we are missing in configuration ?

Regards,

Nandkishor



 Comments   
Comment by Eric Sedor [ 28/Jul/21 ]

Hi ndchavan4289@gmail.com, I'm going to close this ticket out for now but please reach out again if the issue reoccurs and chkdsk reveals nothing. Thank you!

Comment by Nandkishor Chavan [ 06/Jul/21 ]

Hi Eric, till now, we have not faced a similar issue again on any site, I will get back to you once I face the issue again. 

Also, I am not sure how much Procmon will be effective in this case, as we usually don't have the access to sites all the time, so won't be able to run the Procmon at the time when the issue occurs. Once the event has occurred, Procmon won't be able to give us the details on the process which may be causing this issue.

Also, if the issue is occurring because of physical disk corruption, then how does it work after repairing. We will still check with chkdsk, if the issue occurs again.

Comment by Eric Sedor [ 01/Jul/21 ]

Hi ndchavan4289@gmail.com, have you had any success ruling out file access issues via Procmon, or identifying disk issues via chkdsk?

Comment by Eric Sedor [ 10/Jun/21 ]

Hi ndchavan4289@gmail.com,

It's not precisely clear what deployments are experiencing what symptoms, but it's possible that an attempt to --repair would result in the loss of user metadata if corruption affected a system.users collection.

For the latest logs and dbpath you uploaded, the errors to continue to suggest an issue with physical disk corruption or with other processes interfering with files in the MongoDB dbPath while it is running:

2021-06-09T10:07:27.319+0530 E STORAGE  [WTCheckpointThread] WiredTiger error (-31802) [1623213447:318829][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __win_fs_rename, 105: C:\ProgramData\ItantaDB\WiredTiger.turtle.set to C:\ProgramData\ItantaDB\WiredTiger.turtle: file-rename: MoveFileExW: The disk structure is corrupted and unreadable.
: WT_ERROR: non-specific WiredTiger error Raw: [1623213447:318829][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __win_fs_rename, 105: C:\ProgramData\ItantaDB\WiredTiger.turtle.set to C:\ProgramData\ItantaDB\WiredTiger.turtle: file-rename: MoveFileExW: The disk structure is corrupted and unreadable.
: WT_ERROR: non-specific WiredTiger error
2021-06-09T10:07:27.319+0530 E STORAGE  [WTCheckpointThread] WiredTiger error (-31802) [1623213447:319863][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __wt_turtle_update, 397: WiredTiger.turtle: fatal turtle file update error: WT_ERROR: non-specific WiredTiger error Raw: [1623213447:319863][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __wt_turtle_update, 397: WiredTiger.turtle: fatal turtle file update error: WT_ERROR: non-specific WiredTiger error
2021-06-09T10:07:27.320+0530 E STORAGE  [WTCheckpointThread] WiredTiger error (-31804) [1623213447:319863][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic Raw: [1623213447:319863][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic
2021-06-09T10:07:27.347+0530 F -        [WTCheckpointThread] Fatal Assertion 50853 at src\mongo\db\storage\wiredtiger\wiredtiger_util.cpp 409
2021-06-09T10:07:27.347+0530 F -        [WTCheckpointThread]

In at least one case in the past, one of our users had success using Procmon to identify a process that was causing similar (but not the same) errors.

Do you happen to have run chkdsk after the 2021-06-09T10:07:27.319+0530 incident on IntantaDB?

Comment by Nandkishor Chavan [ 09/Jun/21 ]

Hi Eric,

We have not encountered the same issue again on the same site. But, we have encountered similar kind of issue on another site, hence I am attaching the requested files from this new site.

We have uploaded archived files for db folder and log files to given location for troubleshooting purpose.

Additionally, we have faced one more issue in recent scenario as - We run Mongodb service in authentication mode. However, when we repaired the data, we found that the user we use for authentication was not present in our db. So we had to create it again manually.

Please let me know if you need any more information from our side.

-Nandkishor

Comment by Eric Sedor [ 04/Jun/21 ]

Hi ndchavan4289@gmail.com,

Unfortunately we aren't able to readily identify a specific process that could be responsible. Would you be willing to provide the data files for this cluster for examination? This will help us obtain more information about what state MongoDB expects the data files to be in at the time of error. The process would be:

1) After an incident occurs, ensure the server is stopped and archive a copy of the dbpath (including the log files)
2) Upload that archive to this location: https://mongodb.app.box.com/f/7c2667be0f4248188fab332f2a27659f.

Files shared to this location are private and will be accessible only by MongoDB employees involved with the investigation.

Eric

Comment by Nandkishor Chavan [ 31/May/21 ]

Hi Eric, Thanks for your response. As per your comment, the virus scanner might be accessing the the files in dbpath. However, that's not the case. The antivirus might be scanning the files in dbpath as usual, but it should not affect/corrupt our db as they are not writing or editing the files.

Can you share the process name which is accessing the files in dbpath (with screenshot), so that we can check for that process.

Comment by Eric Sedor [ 27/May/21 ]

Hi ndchavan4289@gmail.com, I am responding here and closing SERVER-57183 as a duplicate.

The logs you've provided suggest that another process on the machine, such as a virus scanner, is accessing files in the dbpath. Can you confirm with certainty that another process is not accessing these files?

You can also upgrade to MongoDB 4.0.7 to obtain the benefits of an improvement in WT-4528, but that improvement is not a guarantee against issues with virus scanners or other processes like them on windows file systems.

Sincerely,
Eric

EDIT: Sorry for introducing confusion. WT-4528 only introduces retries when removing files, not when accessing, as is the case here

Generated at Thu Feb 08 05:41:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.