[SERVER-28810] RangeDeleter appears to abort delete due to 112 WriteConflict Created: 14/Apr/17 Updated: 12/Oct/17 Resolved: 16/May/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.2.13, 3.4.4, 3.5.7 |
| Fix Version/s: | 3.4.5, 3.5.8 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | James Reitz | Assignee: | Nathan Myers |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Backport Completed: | |||||
| Backport Requested: |
v3.4
|
||||
| Steps To Reproduce: | Don't know how to reproduce, and unfortunately I can't share my log files. We're using version 3.2.11. |
||||
| Sprint: | Sharding 2017-05-29 | ||||
| Participants: | |||||
| Description |
|
We have a sharded cluster. One of our primaries had several queued up RangeDeletes from chunks being moved off due to chunk migration. Typically the log shows the following for deleting a chunk after the migration of the chunk to a new primary:
However, occasionally we see:
I can only assume that the Write Conflict was not handled properly, and the documents were never successfully deleted?? |
| Comments |
| Comment by Githook User [ 16/May/17 ] |
|
Author: {u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'}Message: |
| Comment by Nathan Myers [ 16/May/17 ] |
|
Fix summary: range deletions that fail on WT optimistic locking are re-tried. |
| Comment by Githook User [ 16/May/17 ] |
|
Author: {u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'}Message: |
| Comment by Kaloian Manassiev [ 16/May/17 ] |
|
nathan.myers, the write conflict exceptions is something, which can be returned by the WT storage engine's optimistic concurrency control. Normally we retry the operation up to certain number of tries. Have a look at the usages of the MONGO_WRITE_CONFLICT_RETRY_LOOP_BEGIN/END macro. I am pretty sure we need to use this in the range deleter's loop as well and this would solve this problem. |
| Comment by Nathan Myers [ 16/May/17 ] |
|
If a chunk covering part of the deletion target's range were to be migrated in, that would cause another deletion, but only of the new chunk's range. Persistent range-deletion requests have been talked about for 3.8. Is there anything else that needs to be done before then? |
| Comment by Kaloian Manassiev [ 20/Apr/17 ] |
|
Yes, it definitely looks like the document deletion loop in the range deleter does not catch write conflict exceptions. This function no longer exists in the 3.5 series, but the same problem would be exhibited there as well. |
| Comment by James Reitz [ 18/Apr/17 ] |
|
These were range deletes for chunks that had been migrated off of the shard, not a periodic function. After witnessing these deletes fail with the WriteConflict error, the corresponding chunk's begin/end-range is never mentioned again in the logs on that primary server. So, what makes you think they are being retried if they are not being logged again? As I described in the ticket above, a successful chunk deletion normally gets logged. |
| Comment by Mark Agarunov [ 17/Apr/17 ] |
|
Hello jimreitz, Thank you for the report. From what you've described, this appears to be normal behavior. While the RangeDeleter may hit a conflict and fail during a particular operation, these operations would be retried as it runs periodically and the documents will still be deleted. Are you seeing any adverse or unexpected behavior due to this? Thanks, |