[SERVER-57515] DB recovery failing Created: 08/Jun/21 Updated: 05/Jul/21 Resolved: 05/Jul/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.0 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | t b | Assignee: | Dmitry Agranat |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Participants: |
| Description |
|
Our database crashed and recovery is not working. The recovery keeps cycling through these iterations and continuously generates files in the journal directory. These journal files dont exist before the crash date.
Here's the relevant part of the log when the crash and recovery started:
In the journal directory these files are continuously generated and is up to 2036 now: WiredTigerLog.0000002036
Is this database hosed? or is there any way to recover from this? |
| Comments |
| Comment by Dmitry Agranat [ 05/Jul/21 ] | ||||||||||||||||||||
|
Glad to hear the same process worked out for you bh3@digitalblur.com. I will go ahead and resolve this case. Regards, | ||||||||||||||||||||
| Comment by t b [ 04/Jul/21 ] | ||||||||||||||||||||
|
Hi Dmitry, I was able to start up the database by removing the journal directory the same as I had provided to you. Thanks, Tom | ||||||||||||||||||||
| Comment by Dmitry Agranat [ 28/Jun/21 ] | ||||||||||||||||||||
|
Hi bh3@digitalblur.com, it looks like the image you tried to attach in your last comment didn't make it. I was able to start up a healthy mongod instance with the data you have provided, without any issues. Just to make sure, I executed validate on all collections. The only difference I had, as compared to your deployment, I used ubuntu 18.04 w/o docker-compose. I recommend trying to start up your mongod instance on plain ubuntu 18.04 w/o the docker-compose and if everything works fine, trying to work with the team that maintains these docker-compose images to figure out what might be the issue. Dima | ||||||||||||||||||||
| Comment by t b [ 24/Jun/21 ] | ||||||||||||||||||||
|
Hi Dmitry, For the file upload, correct, I selected everything in the db path directory except for the journal files. I am using the docker version from the docker repository here:
Thanks, Tom | ||||||||||||||||||||
| Comment by Dmitry Agranat [ 24/Jun/21 ] | ||||||||||||||||||||
|
Thanks bh3@digitalblur.com, I just want to clarify regarding the uploaded data. This is a complete set of all files (97 GB) excluding the journal files (85 GB), a total of ~12 GB. Is this correct? Also, I see that the "modules" section is empty:
Could you point me to the link where you have downloaded the binaries for this build? | ||||||||||||||||||||
| Comment by t b [ 22/Jun/21 ] | ||||||||||||||||||||
|
Hi Dmitry, I uploaded the zipped db directory without the journal files | ||||||||||||||||||||
| Comment by Dmitry Agranat [ 16/Jun/21 ] | ||||||||||||||||||||
|
Hi bh3@digitalblur.com, would it be possible for you to upload a copy of the dbpath for us so we could investigate this further? | ||||||||||||||||||||
| Comment by t b [ 15/Jun/21 ] | ||||||||||||||||||||
|
Hi Dmitry, I am using docker version 4.4.6:
Size of db directory:
| ||||||||||||||||||||
| Comment by Dmitry Agranat [ 14/Jun/21 ] | ||||||||||||||||||||
|
Hi bh3@digitalblur.com, a couple of clarifying questions:
| ||||||||||||||||||||
| Comment by t b [ 14/Jun/21 ] | ||||||||||||||||||||
|
Hi Dmitry, I added the files requested to the same link above. I dont know where the messages log is though. This computer is a mac. The original server where the database was running is using docker and is set to always restart so its always in a repair loop. The instance was stopped by running docker-compose down. I then connected the server hard drive to my mac os x computer and copied the db files from the external hard drive to my mac os x hard drive and brought the mongo db server up using docker-compose up and the repair process automatically started. | ||||||||||||||||||||
| Comment by Dmitry Agranat [ 13/Jun/21 ] | ||||||||||||||||||||
|
Thanks for the update bh3@digitalblur.com, Could you update the same set of logs from the new computer as you did in this comment as well as the archive of the diagnostic.data? Also, could you clarify how did you copy the data to a different computer? Specifically, what was the state of the mongod process on the source, was it still in the repair loop? | ||||||||||||||||||||
| Comment by t b [ 12/Jun/21 ] | ||||||||||||||||||||
|
Hi Dimitry, I've copied the database files to a different computer and different disk and Im seeing the same activity Where the repair process keeps looping with the same error:
| ||||||||||||||||||||
| Comment by Dmitry Agranat [ 11/Jun/21 ] | ||||||||||||||||||||
|
You are correct, there are indeed repeated kernel errors inside kern.log indicating some low-level sdb disk corruption at different sectors:
The syslog also prints partial map device corruption:
Please let us know if after fixing the disk corruption issue you still experience the reported issue. Thanks, | ||||||||||||||||||||
| Comment by t b [ 11/Jun/21 ] | ||||||||||||||||||||
|
Hi Dmitry, This is a standalone instance. I did notice some disk read errors in the kernel log. Im not sure where the messages log is, can you point me in the right direction? I've uploaded these files: dmesg Thanks, | ||||||||||||||||||||
| Comment by Dmitry Agranat [ 09/Jun/21 ] | ||||||||||||||||||||
|
bh3@digitalblur.com, for completeness, please upload the full mongod log covering the last failed statup, wiredTiger.wt and wiredTiger.turtle files, messages and dmesg log. I've created a secure portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Thanks, | ||||||||||||||||||||
| Comment by Dmitry Agranat [ 09/Jun/21 ] | ||||||||||||||||||||
|
Due to the formatting issue of the original test, I assumed the issue in question was for these messages:
Which was indeed addressed by I have changed the latest log message to show separate lines rather than one long string of all messages concatenated to better see the sequence of events. The relevant message is:
While I am looking into this, could you please answer these questions:
Also, please attach copies of the wiredTiger.wt and wiredTiger.turtle files Thanks, | ||||||||||||||||||||
| Comment by t b [ 08/Jun/21 ] | ||||||||||||||||||||
|
Hi Dmitry, thanks for the info.
I am still seeing this issue after upgrading to 4.4.5: Log after starting up:
And relevant log after going through the recovery cycle:
And another journal file has been created: And the recovery starts another cycle. Any other possibilities for recovery? Thanks | ||||||||||||||||||||
| Comment by Dmitry Agranat [ 08/Jun/21 ] | ||||||||||||||||||||
|
Hi bh3@digitalblur.com, this issue is related to |