[SERVER-19230] WT seg fault on pure read work load Created: 30/Jun/15 Updated: 04/Aug/15 Resolved: 06/Jul/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Charlie Page | Assignee: | Alexander Gorrod |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
| Comments |
| Comment by Charlie Page [ 03/Jul/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
OK, thanks for the update, I've removed the data. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 03/Jul/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
charlie.page@10gen.com I can't think of any other information we can get from the databases. Feel free to blow them away and restart other workloads. It's frustrating because the crash came out of Snappy - and we don't expect snappy to crash. We couldn't uncover any corruption in the database, so without a way to replay the crashing read there isn't much we can discover. We will review the WiredTiger wrapper around Snappy to ensure that we never call it with invalid options. Apart from that I'm stumped on the root cause. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 03/Jul/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
What we are wondering is whether you can reproduce the crash via mongod. We have verified the content of the underlying database files - a corrupted block in one of those pages was the obvious cause for the crash you reported. So we are looking for another way to track down the problem. One other thing I noticed was that the database files in that directory are owned by a combination of root and mongod. Is it possible that the problem was a symptom of a file permissions issue? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 02/Jul/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
charlie.page@10gen.com I have completed low-level verification and the only issue I have seen is this one:
That is basically benign and does not explain the segfault. alexander.gorrod is going to follow up, but what I'm looking for now is a way to catch this again so that we can see what's going on. Can you tell from the query that was running when the crash happened where it was up to? In other words, could you construct a MongoDB query that should repeat the query starting from near where it failed? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 02/Jul/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I've run a verify on the failing table overnight and that succeeded without either crashing or producing any errors. I'll try the other collections in case I am missing something, but if there is a page image that causes snappy_uncompress to segfault, I would have expected this to find it. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 02/Jul/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
As I said, I've never seen snappy segfault during uncompress before. We will need to get to the bottom of that to figure out what is going wrong. A bad read does indeed cause an error message and the read operation fails. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Charlie Page [ 01/Jul/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Yes, I'll follow up in an email with the login credentials. Should a bad read bring down the server? It seems reporting it into the log is a better alternative. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Charlie Page [ 30/Jun/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
xfs_repair output (repair.txt), I believe it's clean, but it's the first time I've had to run it. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 30/Jun/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
michael.cahill, can you please look at the stack trace for clues? If we need more information to track this one down maybe bruce.lucas@10gen.com can work with Charlie to find the root cause of this issue. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Charlie Page [ 30/Jun/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Failed again in 3.0.4, but now it seg faults. I uploaded the last 100k lines of the log. I'll run fsck on the disk tonight, just to be sure, but I doubt it's file system corruption given it's new disks with xfs. I can save the data files for a few days, but I need to remove them to get a (hopefully) working ones in the near future. (As a reminder the data is ~400G compressed.)
conn32
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Charlie Page [ 30/Jun/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
ramon.fernandez full logs are ~65G (currently), too large to upload even compressed. Setup is mongoD WT (otherwise defaults) 3.0.3 with ~450G collection 30 reader threads. I've upgraded to 3.0.4 and am trying again (60 reader threads). The couple of log lines before the signal 6 (abort) are operations taking ~7000ms. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 30/Jun/15 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
charlie.page@10gen.com, can you please upload full logs and post details of your setup? |