[SERVER-29155] "Random" Segmentation faults Created: 12/May/17 Updated: 29/Jan/18 Resolved: 10/Jun/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Ilja | Assignee: | Keith Bostic (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Operating System: | ALL |
| Steps To Reproduce: | Sadly it seems pretty sporadic, so I can't provide steps to reproduce. |
| Participants: |
| Description |
|
Our MongoDB crashes sporadicly in production, currently it happens every 2-3 weeks so we can't reliably reproduce the issue. After a database restart the issue is gone for the next week(s). This database is being accessed by potentially 5 other dedicated servers this are it's mongostats:
|
| Comments |
| Comment by Keith Bostic (Inactive) [ 10/Jun/17 ] | ||||||||||||||
|
That's great news irwks, thanks for letting us know! | ||||||||||||||
| Comment by Ilja [ 10/Jun/17 ] | ||||||||||||||
|
Hey Keith, sorry for not getting back to this issue, as suggested we migrated to another system with totally different hardware. Since this migration we haven't experienced this issue (running for one week now). Our further investigation resulted in a potential thermal problem at our hosting providers data center, since the first migration was performed on a hardware vice identical machine and in some logs we found messages about the cpus being throttled to a lower tick rate in reason of temperature It seems that the segmentation faults occurred right after this temperature incidents. | ||||||||||||||
| Comment by Keith Bostic (Inactive) [ 09/Jun/17 ] | ||||||||||||||
|
Hi irwks, just wanted to make sure we don't lose track of this problem, are you still investigating on your side? | ||||||||||||||
| Comment by Keith Bostic (Inactive) [ 03/Jun/17 ] | ||||||||||||||
|
Hi irwks, I'm sorry to see that you're still having problems. I've been staring at this one today, and obviously, it's similar to the last failure you saw. Unfortunately, the cause looks the same: it's a segment fault in an extremely heavily-used code path (key lookup in the underlying Btree), and it's a unique failure for the MongoDB release, we haven't had any other customer or user experience segment faults in this path. The SEGV address is 0x400000153, that is, the low bit is set, and here's the code where it appears we're failing:
In this line, *sizep is points to an address in the cursor that we're using all the time, so I'm expecting ref or ref->ref_ikey are the problems. We indirected through ref at line 713, and we checked for a low-order bit set in ref->ref_ikey at line 714, which means we shouldn't be arrive at line 720 and attempt to indirect through 0x400000153. I'll have someone else review my analysis to make sure I'm not missing something, but this sure looks like memory corruption to me. It's potentially something else (perhaps a wild pointer from some other part of the code, which might be workload based), but absent a reproducible test case, we're going to struggle to debug that. Have you seen any more failures since this report? Is there any chance your provider migrated you back to the previous system? Is there anything the two failing systems have in common that might be relevant? It's unlikely a disk subsystem failure would lead to this kind of failure. | ||||||||||||||
| Comment by Ilja [ 27/May/17 ] | ||||||||||||||
|
Hello again, as we all know, weekend days are the famous "failure-in-production-days", so sadly I have to update this issue. Our MongoDB instance failed today again on the new machine after running for eight days without any issues. We uploaded the diagnostic data and the mongod.log on the given portal. Thank you, | ||||||||||||||
| Comment by Kelsey Schubert [ 22/May/17 ] | ||||||||||||||
|
Hi irwks, Thanks for confirming that changing the hardware resolved the issue. Please feel free to reference this ticket. Besides memory corruption, another possibility is that the filesystem had a bad block. Thanks again, | ||||||||||||||
| Comment by Ilja [ 21/May/17 ] | ||||||||||||||
|
Hello Thomas, thank you very much for your work. We moved our database to another hardware instance since then everything kept running smoothly, even under a more heavy load which our network is currently experiencing since the one dedicated root server ist missing. If this is fine for you, I will refer to this issue when talking to our hosting provider since the RAM-check on the mentioned server did not return any errors. I will also keep this issue updated if something happens in the case. Thank you! | ||||||||||||||
| Comment by Kelsey Schubert [ 19/May/17 ] | ||||||||||||||
|
Hi irwks, We've analyzed the provided stack traces. This behavior appears to be the result of fault hardware, specifically memory corruption would most likely explain these segmentation faults. Would you move this mongod instance to a host with new hardware and confirm the that issue is resolved? Thank you, | ||||||||||||||
| Comment by Ilja [ 17/May/17 ] | ||||||||||||||
|
Brief Update: After running smoothly for the last 3 days our system is now failing in production right after restart and and running a couple of queries. I just uploaded the current mongod.log file, the diagnostic data and the matching syslog snippet. | ||||||||||||||
| Comment by Ilja [ 15/May/17 ] | ||||||||||||||
|
Alright, @keith.bostic. To clear out any potential OS installation issues I reinstalled the host completely from scratch and only restored a database backup. If it happens again, I will upload all the needed diagnostic data. | ||||||||||||||
| Comment by Keith Bostic (Inactive) [ 15/May/17 ] | ||||||||||||||
|
Thank you, ramon.fernandez. irwks, I haven't given up yet, but so far nothing here is pointing to the problem. Since you're seeing repeated failures, would you upload the same information for your next failure, as you did for this one? Thank you! | ||||||||||||||
| Comment by Ramon Fernandez Marina [ 15/May/17 ] | ||||||||||||||
|
Unfortunately the uploaded logs don't have any additional information that can help us here. We'll let you know if there's other information you may collect that can give us a clue here – thanks in advance for your continued patience. Regards, | ||||||||||||||
| Comment by Ilja [ 12/May/17 ] | ||||||||||||||
|
Of course, @Keith Bostic, I just uploaded the relevant part of the messages.log file via the given upload portal. | ||||||||||||||
| Comment by Keith Bostic (Inactive) [ 12/May/17 ] | ||||||||||||||
|
irwks, if more failures occur, would you please get us copies of those logs, too? And, could we please have copies of the system diagnostic data (/var/log/messages) covering the time period of this failure, and, of course, any future failures. Thank you! | ||||||||||||||
| Comment by Ilja [ 12/May/17 ] | ||||||||||||||
|
Hello Thomas, thank you for your fast response. I have uploaded a zip archive of all requested data, beware that the files are pretty big so the interesting part of mongod.log starts at line 1265526. Thanks in advance, Ilja | ||||||||||||||
| Comment by Kelsey Schubert [ 12/May/17 ] | ||||||||||||||
|
Hi irwks, Would you please upload an archive of the diagnostic.data directory as well as the complete log files of the affected mongod? I've created a secure upload portal for you to provide these files. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time. Thank you, |