[SERVER-46845] Shard which received a StaleShardVersion can get stuck indefinitely in a moveChunk command Created: 13/Mar/20 Updated: 29/Oct/23 Resolved: 08/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.0-rc2, 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Blake Oler |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-4.4-stabilization | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v4.4
|
||||||||
| Sprint: | Sharding 2020-03-23, Sharding 2020-04-06, Sharding 2020-04-20 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 15 | ||||||||
| Description |
|
When a shard updates its knowledge of its shard version, post migration commit, it logs a message which looks like this:
This log line comes from here and if we zoom inside CollectionMetadata::toStringBasic(), the call to log the current shard version will invoke ChunkManager::getVersion(ShardId). If it so happens that the CatalogCache's entry for a collection gets invalidated with the local shard id and there is a concurrently running migration, it is possible that the completion of the chunk migration will get stuck indefinitely, because ChunkManager::getVersion will keep throwing ShardInvalidatedForTargetingInfo exceptions and will keep getting retried under refreshFilteringMetadataUntilSuccess I think the bug is currently not happening, because somehow after the logging changes were committed this line is no longer logged in the test output. For example here. |
| Comments |
| Comment by Githook User [ 13/Apr/20 ] |
|
Author: {'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}Message: (cherry picked from commit 6c88549226e8f68f06b19d9e9a1e0b5f756494b0) |
| Comment by Githook User [ 06/Apr/20 ] |
|
Author: {'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}Message: |
| Comment by Randolph Tan [ 01/Apr/20 ] |
|
Approach sounds good to me |
| Comment by Kaloian Manassiev [ 01/Apr/20 ] |
|
Yes, placing the second catalog cache behind a feature flag, as opposed to testCommandsEnabled seems like the right way to go (this is what is being discussed under BF-16423 and |
| Comment by Blake Oler [ 31/Mar/20 ] |
|
Confirmed my above guess via local testing. The addition of the second catalog cache obfuscates the underlying error. This is because the second catalog cache will never see shards marked as stale. This makes me nervous for our test coverage. I'd like to be able to place the second catalog cache behind a feature flag, so that we're able to turn it on/off for different testing variants. kaloian.manassiev thoughts? As for the error itself, this is because we're attempting to access an outdated RoutingTableHistory object to log its shard version. We should be able to access all versions for logging purposes without throwing a ShardInvalidatedForTargeting exception. In a local patch, I've added secondary getVersionForLogging function calls that don't check if the version is stale, that we use if we're only trying to log. This fixes the issue. Good on the approach? renctan But I'm unable to test this issue via a JS test unless we turn off the second catalog cache via feature flag. I'll dig to see if I can test this via unit test, but this seems to be a scenario better tested with an integration test. |
| Comment by Blake Oler [ 13/Mar/20 ] |
|
The most educated guess I can make at the moment is not that the logging changes affected the behavior to allow the affected BF to pass again. Rather, it's that the introduction of the second catalog cache (introduced around the same time) allowed this function to pass. I don't know if this is desired behavior or not yet. I will need to spend time thinking about how the second cache interacts with PM-1633. For now, I wouldn't say this is a 4.4-rc0 blocking issue. |