[SERVER-38219] dataSize, collStats diagnostic commands not detecting change of shards after a sharded collection is dropped Created: 21/Nov/18  Updated: 15/Nov/21  Resolved: 14/Apr/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: 章 黒鉄 Assignee: Cheahuychou Mao
Resolution: Duplicate Votes: 1
Labels: ShardingRoughEdges, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.sh     Text File repro_result.txt    
Issue Links:
Duplicate
duplicates SERVER-47436 Make shards validate shardKey in data... Closed
Operating System: ALL
Steps To Reproduce:

Use a 2+ shard cluster and 2 mongos nodes not being used by any other clients.

  • Create a collection and shard on field "a". No need to make it particularly big.
    (To reproduce the issue for collStats I think the chunks need to be all on one shard. But if it has already split and balanced the dataSize command's issue will still be reproducible.)
  • Confirm that dataSize and collStats commands work correctly via both mongos nodes.
  • Using mongos #1 node drop the collection. Recreate and reshard by another field "b". Split and balance it so chunks are on 2 or more shards.
  • On mongos #2 run dataSize and collStats commands without doing anything else with the collection first. 
    • dataSize will fail with "keyPattern must equal shard key" error
    • collStats will only fetch from shard(s) that it used originally.
  • Run a single insert command on the collection. The dataSize, and I think also the collStats command, works correctly immediately after that.

 

Sprint: Sharding 2020-04-20
Participants:

 Description   

It seems the dataSize and collStats commands rely on the setShardVersion mechanism to be triggered by CRUD traffic coming through the same mongos node to flush the metadata, and will be incorrect until one of those happen, or a flushRouterConfig command is run.

This caught me out recently as a mongos node used exclusively by the DBAs didn't reflect changes made by the app team on the app server's mongos nodes. All the normal queries and updates were being done by those apps, whereas I was only doing diagnosis (no CRUD ops). For half a day I investigated an imaginary data imbalance issue the dataSize and collStat commands were showing me, only to find as soon as I ran a flushRouterConfig the issue was gone. I think doing a single CRUD op on the collection in question also resolves it.

The production event preceding the issue was the dropping of a big sharded collection and recreating it with a new shard key, but presumably mongos nodes will also be stale for other metadata changes such as chunk moves.

Issue was encountered in 3.6.7, but so far as I can see 4.0 code for these commands (and maybe all non-CRUD commands?) is using the catalog cache class in the same way, so I suspect it's still an issue for current release versions too.



 Comments   
Comment by 章 黒鉄 [ 15/Apr/20 ]

Thanks for update. Sounds resolved. Just glad to hear it's resolved in master branch, getting backported to v4.0 is a nice extra.

Comment by Cheahuychou Mao [ 14/Apr/20 ]

Closing this ticket since SERVER-47436 was done and backported all the way to v4.0. Unfortunately, dataSize command in v3.6 does not use shard versioning so a flushRouterConfig or a CRUD command is required to make the stale mongos refresh.

Comment by Mira Carey [ 13/Feb/20 ]

Adding to a wfbf day to investigate why versioning doesn't cover this case

Comment by Alexey Menshikov [ 31/Jan/19 ]

I was able to reproduce both issues on 3.6.7 and 4.0.5

 

Comment by Kelsey Schubert [ 21/Nov/18 ]

Thanks for the detailed report, akira!

Generated at Thu Feb 08 04:48:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.