Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-38219

dataSize, collStats diagnostic commands not detecting change of shards after a sharded collection is dropped

    • ALL
    • Hide

      Use a 2+ shard cluster and 2 mongos nodes not being used by any other clients.

      • Create a collection and shard on field "a". No need to make it particularly big.
        (To reproduce the issue for collStats I think the chunks need to be all on one shard. But if it has already split and balanced the dataSize command's issue will still be reproducible.)
      • Confirm that dataSize and collStats commands work correctly via both mongos nodes.
      • Using mongos #1 node drop the collection. Recreate and reshard by another field "b". Split and balance it so chunks are on 2 or more shards.
      • On mongos #2 run dataSize and collStats commands without doing anything else with the collection first. 
        • dataSize will fail with "keyPattern must equal shard key" error
        • collStats will only fetch from shard(s) that it used originally.
      • Run a single insert command on the collection. The dataSize, and I think also the collStats command, works correctly immediately after that.

       

      Show
      Use a 2+ shard cluster and 2 mongos nodes not being used by any other clients. Create a collection and shard on field "a". No need to make it particularly big. (To reproduce the issue for collStats I think the chunks need to be all on one shard. But if it has already split and balanced the dataSize command's issue will still be reproducible.) Confirm that dataSize and collStats commands work correctly via both mongos nodes. Using mongos #1 node drop the collection. Recreate and reshard by another field "b". Split and balance it so chunks are on 2 or more shards. On mongos #2 run dataSize and collStats commands without doing anything else with the collection first.  dataSize will fail with "keyPattern must equal shard key" error collStats will only fetch from shard(s) that it used originally. Run a single insert command on the collection. The dataSize, and I think also the collStats command, works correctly immediately after that.  
    • Sharding 2020-04-20

      It seems the dataSize and collStats commands rely on the setShardVersion mechanism to be triggered by CRUD traffic coming through the same mongos node to flush the metadata, and will be incorrect until one of those happen, or a flushRouterConfig command is run.

      This caught me out recently as a mongos node used exclusively by the DBAs didn't reflect changes made by the app team on the app server's mongos nodes. All the normal queries and updates were being done by those apps, whereas I was only doing diagnosis (no CRUD ops). For half a day I investigated an imaginary data imbalance issue the dataSize and collStat commands were showing me, only to find as soon as I ran a flushRouterConfig the issue was gone. I think doing a single CRUD op on the collection in question also resolves it.

      The production event preceding the issue was the dropping of a big sharded collection and recreating it with a new shard key, but presumably mongos nodes will also be stale for other metadata changes such as chunk moves.

      Issue was encountered in 3.6.7, but so far as I can see 4.0 code for these commands (and maybe all non-CRUD commands?) is using the catalog cache class in the same way, so I suspect it's still an issue for current release versions too.

        1. repro_result.txt
          2 kB
          akira
        2. repro.sh
          3 kB
          akira

            Assignee:
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            Reporter:
            akira.kurogane@gmail.com 章 黒鉄
            Votes:
            1 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: