[SERVER-17587] Node crash scenario results in uncrecoverable error on subsequent startup under WiredTiger Created: 13/Mar/15 Updated: 04/Jun/15 Resolved: 27/Mar/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.0.1 |
| Fix Version/s: | 3.0.2, 3.1.1 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Michael Cahill (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | 28qa | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Completed: | |||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Issue Status as of Apr 02, 2015 ISSUE SUMMARY USER IMPACT WORKAROUNDS AFFECTED VERSIONS FIX VERSION Original descriptionReproduce as follows. WARNING: this code will crash your machine.
On subsequent startup recovery fails and mongod terminates:
Note that this affects 3.0.1-rc0 and not 3.0.0, so it appears to be a regression. This scenario seems to require the validate(), so I suspect it could be related to the new functionality in |
| Comments |
| Comment by Githook User [ 03/Apr/15 ] | ||||||
|
Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@wiredtiger.com'}Message: Always go through meta tracking when closing a dirty file, to ensure the metadata is checkpointed. refs | ||||||
| Comment by Githook User [ 03/Apr/15 ] | ||||||
|
Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@wiredtiger.com'}Message: Add a test of metadata durability across exclusive operations like verify. refs | ||||||
| Comment by Githook User [ 03/Apr/15 ] | ||||||
|
Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@wiredtiger.com'}Message: Change the sweep server to only operate on clean files. Track the maximum transaction ID seen in the checkpoint of a file so that we can be sure in sweep that all pages can be discarded (without dirtying anything in the tree). Preparation work for | ||||||
| Comment by Michael Cahill (Inactive) [ 27/Mar/15 ] | ||||||
|
Resolved with latest drop from WT. | ||||||
| Comment by Keith Bostic (Inactive) [ 20/Mar/15 ] | ||||||
|
Just for the record, this has turned out to be relatively difficult and involved to fix – michael.cahill has stepped in and is working the problem. | ||||||
| Comment by Keith Bostic (Inactive) [ 16/Mar/15 ] | ||||||
|
What's happening here:
What's happening is that switching from non-exclusive to exclusive handles in steps 3 & 4 forces us to write checkpoints into the collection-2-XXX file, the first time we get some new blocks, the second time we reuse the original blocks (but the checksum will have changed). After the crash we read the original checkpoint information from the metadata file (no log, and it never got updated), find an invalid checksum at that page, and crash. IIRC, the whole checkpoint.resolve work was to make sure there was never an invalid metadata checkpoint entry (at the cost of potentially leaking blocks if we race), but for that to work in this case, we have to force the metadata to disk between the two updates, that is, two close & reopen cycles without an intervening metadata write gets us into trouble. I'm working on a fix. |