[SERVER-33071] Mongodb Crashes with signal: 6 using IXSCAN on MMAPV1 Created: 02/Feb/18 Updated: 21/Mar/18 Resolved: 16/Feb/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Admin |
| Affects Version/s: | 3.4.10 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Andrey Melnikov | Assignee: | Bruce Lucas (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Participants: |
| Description |
|
Hello I get the following backtrace in log when my db 3.4.10 master instance with no replica set running on Ubuntu 16.04.3 LTS Possibly related to https://jira.mongodb.org/browse/SERVER-28001 However ulimit -n 64000 is set permanently in settings and also I run mongod --repair on database recently when moved it from other server but still no luck and db crashes time to time. The file system of the volume containing database files also is ok. Also I can't reIndex the collection cause it also leads to crash. Could you please advice what can I do to resolve this issue?
|
| Comments |
| Comment by Bruce Lucas (Inactive) [ 02/Feb/18 ] |
|
Hi Andrey, Yes, the remove command would likely encounter the same issue, so there's no way to repair the affected collection in-place, and you would need to insert the recovered documents into a separate collection. I would recommend of course that you do a validate(true) on all of your data to check its integrity. Bruce |
| Comment by Andrey Melnikov [ 02/Feb/18 ] |
|
Hi Bruce And if I determine the area which is corrupted can I delete these records by _id or using skip? Will the remove() command regarding to this bad area lead to crash? Or it is better to insert unaffected documents to separate collection during this forward /backward scan ? |
| Comment by Bruce Lucas (Inactive) [ 02/Feb/18 ] |
|
Hi Andrey, Unfortunately there isn't. If resync or restore are not possible, you might be able to recover some of the data by doing a forward collection scan with a small batch size (e.g. 2), and then repeating with a backward collection scan. Each will error out when it encounters the first bad document, but by using a small batch size you allow the scan to get as close as possible to the point in the list where the error occurs. Neither scan will be able to reach documents between two separate corrupted regions of the linked list. Hope this helps, |
| Comment by Andrey Melnikov [ 02/Feb/18 ] |
|
Thanks, Bruce! Is there a way to determine the failing document (or a bunch of them) and then delete it? |
| Comment by Bruce Lucas (Inactive) [ 02/Feb/18 ] |
|
Hi Andrey, That error indicates that while traversing the linked list of records in the mmapv1 data it encountered a pointer to the next record that was less than 8, which is the minimum possible value. We can't provide a definitive diagnosis based on that information alone, but the most likely explanation in our experience is that an error at the storage level occurred and a write failed at some time in the past, leaving a hole in the file that is read back as 0s. You might check syslog for write failures, and perform disk diagnostics. Since mongod repair doesn't work in this case, the recovery options are to resync the node or to restore from backup. Bruce |