[SERVER-17506] Race between inserts and checkpoints can lose records under WiredTiger Created: 07/Mar/15 Updated: 31/Aug/15 Resolved: 09/Mar/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.0.0 |
| Fix Version/s: | 3.0.1, 3.1.0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Michael Cahill (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | ET | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Completed: | |||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Repro:
Produces following errors from app on second time through the loop:
And the following in the mongod log:
|
| Comments |
| Comment by Michael Cahill (Inactive) [ 17/Jun/15 ] |
|
oleg@evergage.com, it's tough to characterize exactly how this bug might have exhibited. The bug was triggered if a database page was not in cache at the beginning of a checkpoint, then it was read into cache and dirtied (a document was inserted, updated or deleted) during the checkpoint. The checkpoint would (correctly) skip over the dirty page because the changes were too recent to appear, but the tree containing the dirty page could end up being marked clean. Then if there were no more updates before mongod was shut down, the dirty page would never get written. Once the bug has occurred (i.e., mongod has been shut down), there is nothing in the data files that could be used to detect or recover from the lost update. As noted here, this was fixed in MongoDB 3.0.1. |
| Comment by Oleg Rekutin [ 17/Jun/15 ] |
|
michael.cahill, do you have any more information on which parts get corrupted? It looks like this problem corrupted data, but does not show itself until later. Is there any way to perform some kind of validation on data structures to look for this? Is it the indices or actual entry storage? Or is it something related to the replication log? |
| Comment by Michael Cahill (Inactive) [ 08/Mar/15 ] |
|
bruce.lucas@10gen.com, thanks for the repro! asya, thanks for identifying the patch! A fix is here: |