[SERVER-37871] $out writing to different database can fail if the shard's routing information is stale Created: 01/Nov/18  Updated: 29/Oct/23  Resolved: 21/Nov/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Sharding
Affects Version/s: None
Fix Version/s: 4.1.6

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: Charlie Swanson
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Query 2018-11-19, Query 2018-12-03
Participants:

 Description   

As part of SERVER-36813 we added a constraint that the epoch of $out's targeted collection must be agreed to by all participating processes, and must not change during the course of the aggregation. This will unintentionally prevent an $out from executing on a shard which does not know of (have any chunks for) the targeted collection, because there's nothing to ensure that shard's routing table is up to date, and no good logic to refresh that shard's routing table (especially if we want to avoid the deadlock described in SERVER-37398).

For this ticket, we'd like to add logic to do the following:
1) Detect at parse time if the shard executing the $out is more stale than the mongos. This has to happen without refreshing the CatalogCache because doing so needs to take a lock which could induce the deadlock described in SERVER-37398.
2) If we detect the shard is stale, throw a new type of exception. Catch this exception in the shard's service entry point, and (now that we've released all the locks) cause the mongod to refresh it's catalog for the $out's targeted collection. After starting this process, we will return an error to the mongos to signal the mongos should retry the entire command.



 Comments   
Comment by Githook User [ 21/Nov/18 ]

Author:

{'name': 'Charlie Swanson', 'email': 'charlie.swanson@mongodb.com', 'username': 'cswanson310'}

Message: SERVER-37871 Enforce agreement on shard key across cluster for $out
Branch: master
https://github.com/mongodb/mongo/commit/cce280f98a8badf8aef4ed960e82e61e61d3fe5e

Comment by Andy Schwerin [ 07/Nov/18 ]

I’ll definitely want to catch up on this before committing to this approach. Talk next week?

Comment by Charlie Swanson [ 07/Nov/18 ]

nicholas.zolnierz esha.maharishi kaloian.manassiev david.storch@mongodb.com and james.wahlin, here's the state of this ticket:

We've all discussed an approach similar to what I have started in a POC here: https://mongodbcr.appspot.com/236080001/. This would continue with mongos attaching an epoch, and introduce a new sort of 'StaleRouting' exception which can be used to refresh the CatalogCache. $out will then manually check that the epoch it received matches what's present in the shard's catalog cache (without refreshing). We believe this should work and we're all happy with this approach going forward.

While discussing within the query team we thought about an alternative where we instead augment the shard versioning protocol (not the first time this idea has been floated). After agreeing this may make sense now within the query team, I spoke to Esha and we agree it's at least worth exploring. Given this, I'm starting typing on a POC of this. The POC design will be to leave the 'shardVersion' attached to each command as is, but accept and understand an additional argument called 'routingInfo' which is a map from each namespace to the chunk version understood by mongos. For example:

{
  aggregate: "foo",
  pipeline: [{$out: {to: "bar", mode: "replaceDocuments", uniqueKey: {_id: 1}}}],
  cursor: {},
  $db: "test",
  shardVersion: [ Timestamp(2, 0), ObjectId('5bdb768721f0f6e2d33f4dac') ],  // Describes the version of "test.foo".
  routingInfo: {
    "test.bar": [ Timestamp(2, 0), ObjectId('5bdb768721f0f6e2d33f4dad') ]
  }
}

This approach would make $out less "special" in the double-versioning, but we would still want a bit of custom $out code to ensure the version of the target namespace is checked at parse time. Otherwise it would be checked on the first lock for that namespace which would be after we've already executed most of the pipeline and begin to perform writes.

Esha pointed out that a protocol like this alternative proposal may help with future sharding projects such as a "sharded rename" which will need to involve multiple namespaces.

cc schwerin there's a lot of context here but I want to be sure you're kept in the loop. It may make sense to catch up when you're back from abroad.

Generated at Thu Feb 08 04:47:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.