[SERVER-46845] Shard which received a StaleShardVersion can get stuck indefinitely in a moveChunk command Created: 13/Mar/20  Updated: 29/Oct/23  Resolved: 08/Apr/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.4.0-rc2, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Blake Oler
Resolution: Fixed Votes: 0
Labels: sharding-4.4-stabilization
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Sharding 2020-03-23, Sharding 2020-04-06, Sharding 2020-04-20
Participants:
Linked BF Score: 15

 Description   

When a shard updates its knowledge of its shard version, post migration commit, it logs a message which looks like this:

[ShardedClusterFixture:job0:shard1:primary] 2020-02-04T15:18:15.510+0000 I  COMMAND  [conn291] command admin.$cmd appName: "tid:54" command: getMore { getMore: 3975999160804323048, collection: "$cmd.aggregate", lsid: { id: UUID("d2eb0b2e-2ff8-4263-ab12-e5f9514ff6a4"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, $clusterTime: { clusterTime: Timestamp(1580829495, 39), signat[ShardedClusterFixture:job0:shard1:primary] 2020-02-04T15:18:11.560+0000 I  SHARDING [conn55] Updating metadata for collection config.system.sessions from collection version: 15|0||5e398add924cca4d6c4487b2, shard version: 0|0||5e398add924cca4d6c4487b2 to collection version: 16|0||5e398add924cca4d6c4487b2, shard version: 16|0||5e398add924cca4d6c4487b2 due to version change

This log line comes from here and if we zoom inside CollectionMetadata::toStringBasic(), the call to log the current shard version will invoke ChunkManager::getVersion(ShardId).

If it so happens that the CatalogCache's entry for a collection gets invalidated with the local shard id and there is a concurrently running migration, it is possible that the completion of the chunk migration will get stuck indefinitely, because ChunkManager::getVersion will keep throwing ShardInvalidatedForTargetingInfo exceptions and will keep getting retried under refreshFilteringMetadataUntilSuccess

I think the bug is currently not happening, because somehow after the logging changes were committed this line is no longer logged in the test output. For example here.



 Comments   
Comment by Githook User [ 13/Apr/20 ]

Author:

{'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}

Message: SERVER-46845 Allow logging operations to bypass ShardInvalidatedForTargeting exception when accessing shard versions

(cherry picked from commit 6c88549226e8f68f06b19d9e9a1e0b5f756494b0)
Branch: v4.4
https://github.com/mongodb/mongo/commit/2fdb36056997121e483ce758fc96400ad40bb24e

Comment by Githook User [ 06/Apr/20 ]

Author:

{'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}

Message: SERVER-46845 Allow logging operations to bypass ShardInvalidatedForTargeting exception when accessing shard versions
Branch: master
https://github.com/mongodb/mongo/commit/6c88549226e8f68f06b19d9e9a1e0b5f756494b0

Comment by Randolph Tan [ 01/Apr/20 ]

Approach sounds good to me

Comment by Kaloian Manassiev [ 01/Apr/20 ]

Yes, placing the second catalog cache behind a feature flag, as opposed to testCommandsEnabled seems like the right way to go (this is what is being discussed under BF-16423 and SERVER-46726).

CC tommaso.tocci

Comment by Blake Oler [ 31/Mar/20 ]

Confirmed my above guess via local testing. The addition of the second catalog cache obfuscates the underlying error. This is because the second catalog cache will never see shards marked as stale. This makes me nervous for our test coverage. I'd like to be able to place the second catalog cache behind a feature flag, so that we're able to turn it on/off for different testing variants. kaloian.manassiev thoughts?

As for the error itself, this is because we're attempting to access an outdated RoutingTableHistory object to log its shard version. We should be able to access all versions for logging purposes without throwing a ShardInvalidatedForTargeting exception. In a local patch, I've added secondary getVersionForLogging function calls that don't check if the version is stale, that we use if we're only trying to log. This fixes the issue. Good on the approach? renctan

But I'm unable to test this issue via a JS test unless we turn off the second catalog cache via feature flag. I'll dig to see if I can test this via unit test, but this seems to be a scenario better tested with an integration test.

Comment by Blake Oler [ 13/Mar/20 ]

The most educated guess I can make at the moment is not that the logging changes affected the behavior to allow the affected BF to pass again. Rather, it's that the introduction of the second catalog cache (introduced around the same time) allowed this function to pass. I don't know if this is desired behavior or not yet. I will need to spend time thinking about how the second cache interacts with PM-1633. For now, I wouldn't say this is a 4.4-rc0 blocking issue.

Generated at Thu Feb 08 05:12:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.