[SERVER-22634] Data size change for oplog deletes can overflow 32-bit int Created: 16/Feb/16 Updated: 29/Mar/16 Resolved: 21/Feb/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | 3.0.9 |
| Fix Version/s: | 3.0.10 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Mathias Stearn |
| Resolution: | Done | Votes: | 1 |
| Labels: | RF | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Issue Status as of Feb 29, 2016 ISSUE SUMMARY Under write-intensive workloads, it is possible for the oplog of a replica set to grow past its configured size. If this happens, the system will attempt to remove up to 20,000 documents from the oplog to shrink it. If the total size of those 20,000 documents exceeds 2GB, this document removal will result in an overflow condition in the 32-bit integer that records the size change. As a result, the size change will be improperly recorded while the oplog will still appear to exceed the maximum configured size, so the system will attempt to delete more data from the oplog. In extreme cases this can result in the entire contents of the oplog being deleted. While regular capped collections can be affected by this bug as well, it is very unlikely given the nature of this bug. USER IMPACT In the unlikely case a regular capped collection is affected, the system will remove data from the capped collection at a faster than normal pace, so it is possible that the collection is emptied completely. WORKAROUNDS AFFECTED VERSIONS FIX VERSION Original descriptionIn wiredtiger_record_store.cpp, _increaseDataSize is declared to take an int for the size change:
But when called from cappedDeleteAsNeeded_inlock, the amount may overflow a 32-bit int if many large records are being deleted, resulting in (very) inaccurate accounting of the size of an oplog. This can result in the oplog deleter thread deleting everything in the oplog in order to try to get it back down to the configured maximum size, causing replication to cease. |
| Comments |
| Comment by Githook User [ 21/Feb/16 ] |
|
Author: {u'username': u'martinbligh', u'name': u'Martin Bligh', u'email': u'mbligh@mongodb.com'}Message: Fix for (cherry picked from commit 2a11d0957b397e2c9bcb4230da9d764b50aaac3b) |
| Comment by Bruce Lucas (Inactive) [ 17/Feb/16 ] |
|
I believe all capped collection deletion goes through that code path, so yes, I think so. |
| Comment by Kevin Pulo [ 16/Feb/16 ] |
|
This affects all high-throughput capped collections, not just the oplog, right? |
| Comment by Bruce Lucas (Inactive) [ 16/Feb/16 ] |
|
I think so, although it wouldn't quite apply cleanly because _amount is already an int, not a bool in 3.0.9. |
| Comment by Adam Midvidy [ 16/Feb/16 ] |
|
would backporting |
| Comment by Bruce Lucas (Inactive) [ 16/Feb/16 ] |
|
This appears to have been corrected in 3.2 and master, but not 3.0. |