[SERVER-47025] moveChunk after refine shard key can hang indefinitely due to missing shard key index Created: 20/Mar/20 Updated: 29/Oct/23 Resolved: 04/Jul/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 6.2.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Saltz (Inactive) | Assignee: | [DO NOT USE] Backlog - Sharding EMEA |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | PM-2144-Milestone-0 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Assigned Teams: |
Sharding EMEA
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||
| Sprint: | Sharding 2020-04-06, Sharding 2020-04-20, Sharding 2020-05-04, Sharding 2020-05-18, Sharding 2020-07-13, Sharding 2020-06-01, Sharding 2020-06-15, Sharding 2020-06-29, Sharding 2020-07-27, Sharding 2020-08-24 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
When the resumable range deleter is disabled, the recipient of a chunk starts by removing potentially orphaned documents. After that, it clones necessary indexes from the donor. However, the range deleter relies on the shard key index in order to perform deletions. This can lead to the following scenario: There may be less convoluted scenarios that could cause this as well but I'm having trouble thinking of one. Repro attached. |
| Comments |
| Comment by Jordi Serra Torrens [ 04/Jul/23 ] |
|
|
| Comment by Esha Maharishi (Inactive) [ 19/Nov/20 ] |
|
Bringing this back into Needs Scheduling - it had been on my todo list but never ended up getting finished. I had discussed with Andy that the range deleter should fall back to a collection scan if there is no shard key index, and with Charlie that the range deleter should use a higher-level interface into the query system than deleteWithIndexScan. I tested making the range deleter use getExecutorDelete with a range query, allowing the query system to choose an index if available, but it didn't work if the shard key was hashed. The issue was, the range to delete is stored in terms of the hashed shard key, and I was trying to create a query with $gte and $lt those hashed values. So the query was comparing the hashed values to the actual values and returning nonsense. Andy mentioned the query language should have a $hash operator that applies a hash to the actual values, so that a later pipeline stage can compare two hashed values. A $toHashedIndexKey operator was actually implemented this past summer, see the syntax doc and This may help in falling back to a collection scan, though since pipeline-style removes are not currently supported, it would require using an agg to find the _id's of documents to delete, then a delete to delete them. |
| Comment by Esha Maharishi (Inactive) [ 17/Nov/20 ] |
|
I filed a separate ticket ( |
| Comment by Blake Oler [ 11/Jun/20 ] |
|
Linking BF-17537 to this ticket – a similar scenario lands us in the same infinite loop.
|
| Comment by Esha Maharishi (Inactive) [ 12/May/20 ] |
|
schwerin probably not; I'm moving it 4.4 Required. |