[SERVER-44725] Audit usages of epoch in sharding code Created: 18/Nov/19 Updated: 29/Oct/23 Resolved: 24/Feb/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Saltz (Inactive) | Assignee: | Jack Mulrow |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Sharding 2020-02-24 | ||||||||
| Participants: | |||||||||
| Description |
|
In several places in sharding we use a change of epoch to indicate that a collection had been dropped and recreated. After the refine shard key project however, this logic is no longer valid, since refining a shard key can cause an epoch change. We should audit and fix the places where epoch is used instead of collection UUID to check for collection drops and recreates. Specific examples include the MigrationDestinationManager and the CollectionShardingRuntime. |
| Comments |
| Comment by Jack Mulrow [ 24/Feb/20 ] | |||
|
I don't think it should be that much work to address either 1) or 2), so I filed | |||
| Comment by Esha Maharishi (Inactive) [ 24/Feb/20 ] | |||
|
jack.mulrow what you said makes sense to me. I think it's worth trying to make refineShardKey work correctly when the resumable range deleter is disabled if it is a relatively small amount of effort. A few questions:
I also think it would be good to fix the log lines that are now incorrect, and to update any comments that you found that are outdated. | |||
| Comment by Jack Mulrow [ 14/Feb/20 ] | |||
|
I grepped the code base for `StaleEpoch`, `.epoch()`, and just `epoch` and here's an overview of what I found from looking through the matches that looked related:
There are a few things to note about the "bugs" in item 4) that make me question how important it is to fix them:
So overall, other than edge cases leaving extra orphans in the old range deletion protocol, everywhere we use epochs seems to work fine if the epoch changes because of a refine instead of a drop/recreate. We may want to fix up some error messages / logging, but I don't think there's work here that needs to block an RC or 4.4.0. esha.maharishi, matthew.saltz, what do you guys think? | |||
| Comment by Esha Maharishi (Inactive) [ 19/Nov/19 ] | |||
|
kaloian.manassiev, note that we should not get rid of epochs altogether, because changing the epoch is the way refineShardKey indicates that the shard key has changed. | |||
| Comment by Matthew Saltz (Inactive) [ 19/Nov/19 ] | |||
|
This ticket sprouted out of this comment. It's not about getting rid of epochs (though I also support that), but it's about correcting cases now where we say
e.g. here. Fortunately in these cases we don't care which is before or which is after, if they're different, something is wrong. | |||
| Comment by Kaloian Manassiev [ 19/Nov/19 ] | |||
|
What is the expected outcome of this ticket? Do you want to get rid of epochs altogether (which I support!) or is it about correctness? Also, UUIDs have the same problem like epochs with respect to drop and create, because you can't differentiate which one happened first. |