[SERVER-28810] RangeDeleter appears to abort delete due to 112 WriteConflict Created: 14/Apr/17  Updated: 12/Oct/17  Resolved: 16/May/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.13, 3.4.4, 3.5.7
Fix Version/s: 3.4.5, 3.5.8

Type: Bug Priority: Major - P3
Reporter: James Reitz Assignee: Nathan Myers
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Backport Requested:
v3.4
Steps To Reproduce:

Don't know how to reproduce, and unfortunately I can't share my log files. We're using version 3.2.11.

Sprint: Sharding 2017-05-29
Participants:

 Description   

We have a sharded cluster. One of our primaries had several queued up RangeDeletes from chunks being moved off due to chunk migration. Typically the log shows the following for deleting a chunk after the migration of the chunk to a new primary:

1. Deleter starting delete for: <namespace> from {<begin-range-of-chunk>} -> {<end-range-of-chunk>}, with opId: xxxxxxxx
2. Some time later...Helpers::removeRangeUnlocked time spent waiting for replication: x ms
3. rangeDeleter deleted n documents for <namespace> from {<begin-range-of-chunk>} -> {<end-range-of-chunk>}

However, occasionally we see:

1. Deleter starting delete for: ... (normal log statement as above)
2. some time later... Error encountered while trying to delete range: Error encountered while deleting range: ns<namespace> from {<begin-range-of-chunk>} -> {<end-range-of-chunk>}, cause by:  :: caused by :: 112 WriteConflict
3. No further log statements by the RangeDeleter for the specified chunk range that experienced a write conflict.

I can only assume that the Write Conflict was not handled properly, and the documents were never successfully deleted??



 Comments   
Comment by Githook User [ 16/May/17 ]

Author:

{u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'}

Message: SERVER-28810 Re-try range deletions, backport to 3.4
Branch: v3.4
https://github.com/mongodb/mongo/commit/4351282737916875d039b56cc20b2e6772f2e702

Comment by Nathan Myers [ 16/May/17 ]

Fix summary: range deletions that fail on WT optimistic locking are re-tried.

Comment by Githook User [ 16/May/17 ]

Author:

{u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'}

Message: SERVER-28810 Re-try range deletions
Branch: master
https://github.com/mongodb/mongo/commit/6ad4b7d44832a716a3319e8bfc2aa220b53cb07d

Comment by Kaloian Manassiev [ 16/May/17 ]

nathan.myers, the write conflict exceptions is something, which can be returned by the WT storage engine's optimistic concurrency control. Normally we retry the operation up to certain number of tries. Have a look at the usages of the MONGO_WRITE_CONFLICT_RETRY_LOOP_BEGIN/END macro. I am pretty sure we need to use this in the range deleter's loop as well and this would solve this problem.

Comment by Nathan Myers [ 16/May/17 ]

If a chunk covering part of the deletion target's range were to be migrated in, that would cause another deletion, but only of the new chunk's range.

Persistent range-deletion requests have been talked about for 3.8. Is there anything else that needs to be done before then?

Comment by Kaloian Manassiev [ 20/Apr/17 ]

Yes, it definitely looks like the document deletion loop in the range deleter does not catch write conflict exceptions.

This function no longer exists in the 3.5 series, but the same problem would be exhibited there as well.

Comment by James Reitz [ 18/Apr/17 ]

These were range deletes for chunks that had been migrated off of the shard, not a periodic function. After witnessing these deletes fail with the WriteConflict error, the corresponding chunk's begin/end-range is never mentioned again in the logs on that primary server. So, what makes you think they are being retried if they are not being logged again? As I described in the ticket above, a successful chunk deletion normally gets logged.

Comment by Mark Agarunov [ 17/Apr/17 ]

Hello jimreitz,

Thank you for the report. From what you've described, this appears to be normal behavior. While the RangeDeleter may hit a conflict and fail during a particular operation, these operations would be retried as it runs periodically and the documents will still be deleted. Are you seeing any adverse or unexpected behavior due to this?

Thanks,
Mark

Generated at Thu Feb 08 04:19:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.