[SERVER-44725] Audit usages of epoch in sharding code Created: 18/Nov/19  Updated: 29/Oct/23  Resolved: 24/Feb/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.3.4

Type: Bug Priority: Major - P3
Reporter: Matthew Saltz (Inactive) Assignee: Jack Mulrow
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-43690 Make RangeDeleter correctly handle ep... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2020-02-24
Participants:

 Description   

In several places in sharding we use a change of epoch to indicate that a collection had been dropped and recreated. After the refine shard key project however, this logic is no longer valid, since refining a shard key can cause an epoch change. We should audit and fix the places where epoch is used instead of collection UUID to check for collection drops and recreates.

Specific examples include the MigrationDestinationManager and the CollectionShardingRuntime.



 Comments   
Comment by Jack Mulrow [ 24/Feb/20 ]

I don't think it should be that much work to address either 1) or 2), so I filed SERVER-46370 to track both. I also filed SERVER-46371 to track fixing the log lines.

Comment by Esha Maharishi (Inactive) [ 24/Feb/20 ]

jack.mulrow what you said makes sense to me.

I think it's worth trying to make refineShardKey work correctly when the resumable range deleter is disabled if it is a relatively small amount of effort. A few questions:

  1. (Bullet 4.1 above) Would it be difficult to fix rangeOverlapsChunk to work correctly if the shard key has been refined?
  2. (Bullet 4.2 above) Would it be difficult to make the MigrationDestinationManager store the collection UUID at the beginning of the receive, and change that comparison to be on UUID?

I also think it would be good to fix the log lines that are now incorrect, and to update any comments that you found that are outdated.

Comment by Jack Mulrow [ 14/Feb/20 ]

I grepped the code base for `StaleEpoch`, `.epoch()`, and just `epoch` and here's an overview of what I found from looking through the matches that looked related:

  1. We mostly check epochs to detect when routing information was stale and abort operations on mismatches, sometimes retrying automatically after a refresh. Every place I saw that did this used epochs correctly as far as I could tell, and at worst mismatches lead to increased latency or a manual retry from a client. Some specific examples:
    1. An epoch mismatch for a CRUD op triggers a mongos and shard refresh and an automatic retry by throwing a stale config exception.
    2. move, split, and merge chunk all check the epoch before committing, aborting without retrying if there's a mismatch. The donor in a move also checks its epoch matches the epoch sent by the config server at the beginning of the move and periodically.
    3. refineCollectionShardKey and clearJumboFlag are forwarded to the config primary with an epoch and will abort without retrying on an epoch mismatch.
    4. Aggregation stages that perform writes ($out and $merge) use a cached epoch to check if the shard key for the output collection has changed since the aggregation was planned. Since a the shard key changes after a refine, this seems to work as intended.
  2. There are some miscellaneous places we use epochs that aren't necessarily related to retrying, but don't seem incorrect, e.g.
    1. The map of writes and shard endpoints used in the mongos batch write path considers epochs when differentiating endpoints. It shouldn't be possible for a cached routing table to have both the pre and post refine epoch for a collection, so I don't think this is a problem.
  3. There are a few places where we log or return a misleading error message when we detect an epoch change, typically saying the collection "has been dropped and recreated" which isn't always correct now (e.g. here). We should probably fix these, but I don't think there's much urgency. I also found at least one out of data comment.
  4. The only logic bugs I found were around maintaining the list of chunk ranges being received by the MetadataManager, that can probably lead to orphans in rare cases, but I think we already knew this. In particular, I saw two problems:
    1. When setting new filtering metadata in the MetadataManager, we clear entries from the receiving chunks list that overlap with the new metadata. This comparison goes through the ChunkManager, which uses key strings to compare ranges, which doesn't work correctly for ranges with different numbers of fields, so we might not be removing chunks from this list after a refine.
    2. In MigrationDestinationManager::_forgetReceive(), we don't remove a chunk from the receiving range list if the epoch has changed. If there's a refine during the migration, this won't be correct and we may fail to clear the received range from receiving chunks. There's a similar check in MSM::_notePending() (that Matthew already linked above), but at that point nothing has been added to the receiving chunks list, so aborting shouldn't be a problem.

There are a few things to note about the "bugs" in item 4) that make me question how important it is to fix them:

  1. They seem to require a refine to happen during a migration, which normally won't happen because a refine and move chunk both take a collection distributed lock, so the lock would need to be overtaken or otherwise fail to serialize the operations.
  2. The receiving chunks list only seems to be used when the resumable range deleter is disabled. The only exception is cleanupOrphaned, but I believe SERVER-44160 is intended to fix this already.

So overall, other than edge cases leaving extra orphans in the old range deletion protocol, everywhere we use epochs seems to work fine if the epoch changes because of a refine instead of a drop/recreate. We may want to fix up some error messages / logging, but I don't think there's work here that needs to block an RC or 4.4.0.

esha.maharishi, matthew.saltz, what do you guys think?

Comment by Esha Maharishi (Inactive) [ 19/Nov/19 ]

kaloian.manassiev, note that we should not get rid of epochs altogether, because changing the epoch is the way refineShardKey indicates that the shard key has changed.

Comment by Matthew Saltz (Inactive) [ 19/Nov/19 ]

This ticket sprouted out of this comment. It's not about getting rid of epochs (though I also support that), but it's about correcting cases now where we say

if (epoch != currentEpoch) { 
    /* collection must have been dropped and recreated */ 
}

e.g. here. Fortunately in these cases we don't care which is before or which is after, if they're different, something is wrong.

Comment by Kaloian Manassiev [ 19/Nov/19 ]

What is the expected outcome of this ticket? Do you want to get rid of epochs altogether (which I support!) or is it about correctness?

Also, UUIDs have the same problem like epochs with respect to drop and create, because you can't differentiate which one happened first.

Generated at Thu Feb 08 05:06:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.