[SERVER-36813] Test the defaulting of uniqueKey when the catalog cache is stale Created: 22/Aug/18  Updated: 29/Oct/23  Resolved: 01/Nov/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Sharding
Affects Version/s: None
Fix Version/s: 4.1.5

Type: Task Priority: Major - P3
Reporter: Charlie Swanson Assignee: Charlie Swanson
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-35954 Build uniqueKey from the shard key if... Closed
Backwards Compatibility: Fully Compatible
Sprint: Query 2018-09-10, Query 2018-09-24, Query 2018-10-08, Query 2018-10-22, Query 2018-11-05
Participants:
Linked BF Score: 45

 Description   

We added some testing of what happens when an $out is executing with a stale cache, but we'd like to target the functionality introduced in SERVER-35954 which will also read the cache.

This will include the interesting test case of seeing what unique key is used if a mongos has a stale version of the cache which indicates that a collection is sharded but it actually is not, or if it is sharded but now with a different key.



 Comments   
Comment by Githook User [ 01/Nov/18 ]

Author:

{'name': 'Charlie Swanson', 'email': 'charlie.swanson@mongodb.com', 'username': 'cswanson310'}

Message: SERVER-36813 Be careful when choosing default uniqueKey

Before doing so, refresh the catalog cache to make sure the mongos
serving the request is at least somewhat up to date. Additionally,
communicate the epoch used to choose the uniqueKey from mongos to the
shards, and raise an error from the shards and terminate the
aggregation if the epoch of the targeted collection ever changes.
Branch: master
https://github.com/mongodb/mongo/commit/9282c05723eb9f15a6591613007ebe68561c88cb

Comment by Charlie Swanson [ 04/Sep/18 ]

kaloian.manassiev failing the cluster writes will not fully solve the problem. At that point we have already generated a whole batch of writes with the wrong shard key, which may have led to errors like those described above relating to SERVER-36367. It seems like a full solution would either be gross and complicated and be able to store up errors like that and retry a batch on a stale epoch/changed uuid, or it would involve some sort of round trip up-front to make sure your cache of the shard key is up to date.

Comment by Kaloian Manassiev [ 04/Sep/18 ]

Failing cluster writes on epoch mismatch will solve the problem with the shard key changing from underneath a long executing $out aggregation, so the listIndexes call is only an optimization to avoid restarting a long-running aggregation on the first stale version, right?

Otherwise, what kind of bugs is there a possibility for?

Comment by Nicholas Zolnierz [ 04/Sep/18 ]

It's unfortunate, but sounds like a good plan to me. Per in-person discussion, we may take a perf hit in latency but its unlikely to affect users since the $out will likely be bottlenecked in the writes.

Comment by Charlie Swanson [ 04/Sep/18 ]

Ok I've brainstormed a lot and I'll post a brief update here. There is definitely a large potential for bugs here, because we are not using any sort of shard versioning protocol on the $out namespace, only the source namespace of the aggregation. For example, if your mongos is very stale - either because no one has asked it about the $out collection in a while, or because it was just restarted, the mongos which parses the $out could end up filling in the wrong uniqueKey for the $out stage. Or, imagine the aggregation takes a couple hours to run, if the collection becomes sharded or is dropped and recreated during that period, the uniqueKey could be stale.

The user-impact of using the wrong default uniqueKey is hard to quantify. Because the default uniqueKey will always include the _id, and because you can only end up with duplicate _ids if you managed to insert them on different shards, there will likely still be at most one document matching the uniqueKey for each document output from the $out stage - so pretty low likelihood in a difference of behavior there. However, we will enforce certain constraints, as described in SERVER-36367. This could be weird if we're using the wrong uniqueKey.

While investigating this issue, I realized that the targeter used inside the ClusterWriter for all writes on mongos will keep retrying the writes, even if the collection has been dropped and re-sharded - or has become sharded recently. This is similar in the way that it probably will usually do the right thing, but doesn't feel obviously correct. For example, imagine you are halfway through applying a batch of writes and the collection is dropped and re-created. It's not obvious if this batch should fail - I don't think it does currently. As I discussed with kaloian.manassiev, we could probably tighten this up a bit by tracking within the ChunkManagerTargeter whether or not the collection's UUID has changed during the course of targeting. If it has, it seems pretty reasonable to abort the operation. At that point the shard key may have changed, and it's hard to say if the user wanted the operation to keep going.

After speaking more to asya, it seems like the "concurrent change in shard key" problem isn't worth spending much effort solving. Instead, we should focus on the scenario where the mongos is stale. To achieve this, I propose we always issue a listIndexes command to the primary shard for the $out's target database, and that this command should be versioned (attach a shard version to the command, which I checked does result in a StaleConfig exception if the mongos is stale). On a stale shard version error, we should retry the entire aggregate command, from the top. One of nicholas.zolnierz,s patches for SERVER-36047 already adds a listIndexes round-trip when a user specifies their own uniqueKey, so we aren't so worried about incurring that cost in general.

How's that plan sound to everyone? cc kyle.suarez and david.storch

Generated at Thu Feb 08 04:44:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.