[SERVER-40605] Crashed after mongod started for uncertain period of time Created: 12/Apr/19 Updated: 16/Nov/21 Resolved: 13/May/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.4.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tianlei Mou | Assignee: | Geert Bosch |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | SWNA | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Sprint: | Storage NYC 2019-05-20 |
| Participants: |
| Description |
|
CentOS 7.3, no vm, no container sometimes got signal 11, sometimes got signal 6, [edit: logs provided in comments] |
| Comments |
| Comment by Geert Bosch [ 13/May/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Vincent, after careful analysis of the backtraces and logs that you provided, we cannot make progress on finding a root cause. If you encounter further issues, please let us know. If possible, consider using the validate command to check for consistency issues. Note that this will lock the collection while checking, so make sure to run against a node that is not required to serve other database requests. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tianlei Mou [ 29/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Geert, This 3dis service does many things, here are something might be relative:
We have tried to dump this collection, but failed even having used --forceTableScan. This error is the same in the log. Due to pressure from our customer, we deleted and rebuild this collection (it is intermediate, built based on other collections). This issue seems not happen again. If you still want to investigate this issue, I can provide all the information that I am allowed to. Thanks a lot for your help. Br, Vincent
| |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Geert Bosch [ 25/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
This definitely helps in narrowing down where there are errors. Essentially this means that keys were not in ascending order in the index, which should never happen. I don't know about this 3dis service, can you give more information? | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tianlei Mou [ 25/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Eric, Here are some clues for this ticket:
Do not know if this is relative. Hope it can help a little bit.
| |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 23/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi sh5dragon5, are you able to provide specific details about the storage subsystem for this machine, and let us know what (if any) errors are showing up in syslog? | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Geert Bosch [ 19/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
I looked at this. Because this is called from decodeRecordIdAtEnd, this is not impacted by any data in the buffer except for the size of the buffer and the final bytes encoding the record id. Because the validity of the RecordId does not depend on the validity of the entire buffer, it is unlikely to be caused by a logic error in KeyString. More likely there has been memory corruption in preserving the length of the buffer, or there has been corruption in the last few bytes of the buffer at a time when it was not protected by a checksum. Such corruption can be caused by either a memory overwrite from unrelated code, or a (possibly transient) memory error. If the corruption was introduced before checksum computation, the read triggering the invariant should be repeatable. If the symptoms differ each time, it would appear that the corruption gets introduced each time after successfully reading the data. If it really was flakey memory however, I'd expect a variety of symptoms including CRC errors. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tianlei Mou [ 18/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks Eric. Here are the follow-ups.
| |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 17/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks sh5dragon5. What storage engine are you are using? Can you also please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents of this directory are described here) and attach it to this ticket? | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tianlei Mou [ 17/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry Eric, I missed such an important message in the log. It's lines above.
And for your first question, I will ask our onsite ops check soon. Thanks for your help. | |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 16/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the added information so far.
| |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tianlei Mou [ 15/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tianlei Mou [ 15/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 12/Apr/19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for your report. Are you able to provide the logs preceding the stack trace for both the signal 11 and signal 6 cases? |