[SERVER-60055] Mongo DB is not getting restarted in production Created: 17/Sep/21 Updated: 27/Oct/23 Resolved: 26/Sep/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Gaurav Kumar | Assignee: | Backlog - Triage Team |
| Resolution: | Community Answered | Votes: | 0 |
| Labels: | bkp, buildbot, host-management | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
OS :- Red Hat Linux 8 |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Assigned Teams: |
Server Triage
|
||||
| Backport Requested: |
v4.4
|
||||
| Participants: | |||||
| Description |
|
We are having 4.4 TB of data ( occupied in file system ) and we have created indexes over collections that is occupying data space of 750GB.
|
| Comments |
| Comment by Dmitry Agranat [ 26/Sep/21 ] | ||||||||||
|
Hi gaurav.kumar@aitechbay.com, based on the earlier provided logs, it does not seem like all the issues in your environment were addressed. Unfortunately, MongoDB is not able to repair the corrupted data. To avoid a problem like this in the future, it is our strong recommendation to:
Please reach out to our Consulting team (in the same way you did in the past) to make sure your environment is configured with our Production notes and try to manually repair data. Regards, | ||||||||||
| Comment by Gaurav Kumar [ 23/Sep/21 ] | ||||||||||
|
Hi Dmitry, All issues pointed out was cleared, after that we tried to repair the data, then we got the error as invariant failure. Please note that mongo is running, if we change the dbpath to some other directory. That's why we came to mongo team seeking for help. It would be of great help, if you could check this further. | ||||||||||
| Comment by Dmitry Agranat [ 22/Sep/21 ] | ||||||||||
|
Hi gaurav.kumar@aitechbay.com, the reported issue does not seem to be related to MongoDB but rather OS/environment-related. Unless all of these issues are addressed, MongoDB won't be able to start. There are too many issues to list but I'll point out some examples: Someone/something is periodically killing mongod process. The most devastating effect of this event is when you are killing mongod process repeatedly during recovery. Example:
It's unclear what configuration was used at the time when we were not able to start because of an unknown storage engine. Example:
Permission denied to the journal dbpath. Example:
Operation not permitted to unlock socket file. Example:
Another mongod process is running and we are unable to lock the lock file. Example:
Cannot write to pid file, permission denied. Example:
Read-only directory. Example:
There are some other examples but the bottom line is that currently, mongodb is repeatedly failing due to configuration/environment issues. And because it went through hundreds of such failed cycles, while being repeatedly killed in the middle of recovery, the integrity of the data is not clear at this point.
We do not provide screen sessions for the SERVER project. With no backups, and since this is a standalone, our options to address this issue are limited (even after all the configuration/environment issues are addressed which is outside the scope of the SERVER project). | ||||||||||
| Comment by Gaurav Kumar [ 22/Sep/21 ] | ||||||||||
|
Hi Dmitry, We have not created backup copy as the data is of 4.4 TB, but we have the complete data as it is. We can have one screen share session, if you need. Thanks | ||||||||||
| Comment by Dmitry Agranat [ 22/Sep/21 ] | ||||||||||
|
Hi gaurav.kumar@aitechbay.com, thank you for uploading the requesting information. I just want to make sure regarding the steps of the repair procedure that you've mentioned. Did you create a backup copy of the data files in the --dbpath before executing repair procedure? If not, do you have a recent backup that was taken before the reported issue? | ||||||||||
| Comment by Gaurav Kumar [ 20/Sep/21 ] | ||||||||||
|
Hi Dimtry, Thanks, | ||||||||||
| Comment by Gaurav Kumar [ 20/Sep/21 ] | ||||||||||
|
Hi Dmitry Agranat, Any suggestion, what went wrong? Thanks | ||||||||||
| Comment by Gaurav Kumar [ 19/Sep/21 ] | ||||||||||
|
Thanks Dima for picking this issue. With Regards, | ||||||||||
| Comment by Dmitry Agranat [ 19/Sep/21 ] | ||||||||||
|
Hi gaurav.kumar@aitechbay.com, For a start, we'll need the data covering the original incident, ideally covering some time before the server was restarted and until you have executed the repair command. Would you please archive (tar or zip) the mongod.log files covering the incident and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Thanks, | ||||||||||
| Comment by Gaurav Kumar [ 18/Sep/21 ] | ||||||||||
|
mongo - if (!ready) { else { // Initializing with unfinished indexes may occur during rollback or startup. auto flags = CreateIndexEntryFlags::kInitFromDisk; IndexCatalogEntry* entry = createIndexEntry(opCtx, collection, std::move(descriptor), flags); fassert(4505500, !entry->isReady(opCtx, collection)); }What if, some index is corrupted ( like it didn't complete second phase and system got rebooted) , repair should throw error and remove from repair stack and allow remaining data to go through complete process. At least, other data would be safe and user would be able to get that restore the db back to normal.
| ||||||||||
| Comment by Gaurav Kumar [ 18/Sep/21 ] | ||||||||||
|
Can someone help me on this? |