Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-88238

checkMetadataConsistency interleaves with collMod during upgrade / downgrade

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Catalog and Routing
    • ALL
    • CAR Team 2024-04-01, CAR Team 2024-04-15, CAR Team 2024-04-29, CAR Team 2024-05-13

      The aggregate commands used by checkMetadataConsistency [1] [2] don't set a {readConcern: {level: 'snapshot', atClusterTime: <TS>}}. So that means that the prior reference metadata captured by the shard may be stale if the metadata is modified before the aggregate command runs.

      As a result, it's possible for a collMod (or some other metadata modifying operation) that acts on a shard directly without taking the DDL lock, such as during an upgrade / downgrade, to interleave with the checkMetadataConsistency command, and create a situation where the previous metadata doesn't match with the new metadata, even for the same shard.

      It's not clear whether the bug here is that checkMetadataConsistency doesn't use a snapshot, or that collMod during upgrade / downgrade doesn't take the DDL lock

      Reproducer where I issue a collMod to the shard directly, to make it interleave with checkMetadataConsistency, and checkMetadataConsistency complains that shard 0's metadata doesn't match its own metadata:

      // Shard the coll
      mongos> db.adminCommand({
        shardCollection: 'test.mycoll',
        key: {_id: 1}
      })
      
      // On the shard that the collection lives, set a failpoint here:
      // https://github.com/mongodb/mongo/blob/aadd0e171ac7aa8982618db9aad0dab283d7cdeb/src/mongo/db/s/metadata_consistency_util.cpp#L649
      shard-rs0:primary> db.adminCommand({
          configureFailPoint: "pauseBeforeAgg",
          mode: "alwaysOn"
      });
      
      // Try to check metadata consistency - this will hang on the failpoint.
      mongos> db.checkMetadataConsistency();
      
      // Collmod on the shard directly. This is something that upgrading / downgrading
      // would usually trigger:
      shard-rs0:primary> db.runCommand({collMod: 'mycoll', validator: {a: {$gt: -10}}});
      
      // Turn off the failpoint to let checkMetadataConsistency complete
      shard-rs0:primary> db.adminCommand({
          configureFailPoint: "pauseBeforeAgg",
          mode: "off"
      });
      
      // checkMetadataConsistency would have errored:
      {
      	"cursor" : {
      		"id" : NumberLong(0),
      		"ns" : "test.$cmd.aggregate",
      		"firstBatch" : [
      			{
      				"type" : "CollectionOptionsMismatch",
      				"description" : "Collection registered on the sharding catalog not found on the given shards",
      				"details" : {
      					"namespace" : "test.mycoll",
      					"options" : [
      						{
      							"shards" : [
      								"shard-rs0"
      							],
      							"options" : {
      								"uuid" : UUID("095a4222-0ba3-4d22-b295-fbbf010ce6f9"),
      								"validator" : {
      									"a" : {
      										"$gt" : -10
      									}
      								},
      								"validationLevel" : "strict",
      								"validationAction" : "error"
      							}
      						},
      						{
      							"shards" : [
      								"shard-rs0"
      							],
      							"options" : {
      								"uuid" : UUID("095a4222-0ba3-4d22-b295-fbbf010ce6f9")
      							}
      						}
      					]
      				}
      			}
      		]
      	},
      	"ok" : 1,
      	...
      }
      

            Assignee:
            paolo.polato@mongodb.com Paolo Polato
            Reporter:
            vishnu.kaushik@mongodb.com Vishnu Kaushik
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: