[SERVER-2915] Segfault while creating compound index Created: 07/Apr/11 Updated: 29/Feb/12 Resolved: 21/Jan/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 1.8.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Michael Conigliaro | Assignee: | Mathias Stearn |
| Resolution: | Done | Votes: | 0 |
| Labels: | rsc1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 10.10 |
||
| Operating System: | Linux |
| Participants: |
| Description |
|
This just happened on one of the masters on one of my replica sets while I was creating an index: Thu Apr 7 17:26:34 Invalid access at address: 0x3255000 Thu Apr 7 17:26:34 Got signal: 11 (Segmentation fault). Thu Apr 7 17:26:34 Backtrace: Thu Apr 7 17:26:34 dbexit: Thu Apr 7 17:26:37 Got signal: 11 (Segmentation fault). Thu Apr 7 17:26:37 Backtrace: Thu Apr 7 17:26:37 dbexit: ; exiting immediately |
| Comments |
| Comment by Mathias Stearn [ 02/Sep/11 ] |
|
Do you still have the log file? Also could you try again on 1.8.3? |
| Comment by Michael Conigliaro [ 07/Apr/11 ] |
|
Definitely on 1.8.1. I upgraded yesterday afternoon, and this crash happened this morning when I tried recreating an index that had previously been created with the keys in the wrong order. I also didn't have journaling enabled until about an hour ago (I thought it was enabled by default for some reason). |
| Comment by Eliot Horowitz (Inactive) [ 07/Apr/11 ] |
|
Are you sure this was on 1.8.1 and not on 1.8.0? |
| Comment by Michael Conigliaro [ 07/Apr/11 ] |
Here's the command I ran: db.userEventsJournal.ensureIndex( {"adId":1, "eventId":1, "variationId":1}, {background:true});
Here's the size of the index from a shard where the index was built successfully: "adId_1_eventId_1_variationId_1" : 15024842448
Thu Apr 7 16:20:20 [conn6268] building new index on { adId: 1.0, eventId: 1.0, variationId: 1.0 } for socialmedia.userEventsJournal background
This is the last progress line I saw before the crash: 24787900/77278058 32% Speed-wise, I would say it was about as consistent as all the other shards where the task completed successfully. It usually seems to start out pretty fast for the first few percent, but then it slows down considerably. Sometimes it'll take up to 5 minutes to move 1 percent. But at that point, I'd say it's pretty consistent, but slow. I'm not sure if that means anything. |
| Comment by Gaetan Voyer-Perrault [ 07/Apr/11 ] |
|
Just re-read the title. Looks like you're building an index in the background. Can you provide any insight on the index you're building?
|
| Comment by Michael Conigliaro [ 07/Apr/11 ] |
|
To answer Gaetan Voyer-Perrault's questions on the mailing list:
Not at the moment.
Not really that I could tell. The connection count had been slowly increasing until the crash. Looks like about 3500 connections when the crash occurred, but I have other healthy machines with far more connections than that (6000-7000).
Traffic looked pretty consistent for the hour or so before the crash.
The logs look pretty normal from what I can tell. This is basically what the logs look like before the crash: Thu Apr 7 17:26:22 [conn8548] query admin.$cmd ntoreturn:1 command: { getlasterror: 1 } reslen:99 450ms reslen:99 430ms reslen:99 450ms reslen:99 325ms reslen:99 431ms reslen:99 431ms reslen:99 325ms reslen:99 325ms reslen:99 404ms reslen:99 451ms reslen:99 431ms reslen:99 431ms reslen:1274 2927ms reslen:1274 1033ms
No, the old lockfile was left behind.
Well, I don't see anything strange in dmesg around the time of the crash. I do see quite a few of these from earlier though: 2719077.488378] INFO: task mongod:16140 blocked for more than 120 seconds. I don't think they're related though, based on the timestamp I see here:
These messages are almost definitely from when I upgraded to 1.8.1 yesterday. For some reason, upgrading MongoDB causes the whole machine to hang for several minutes and shoots the load average to 100+, but I think this is an entirely different issue... |