[SERVER-79058] Mongos collection refreshing issues in a sharded cluster Created: 18/Jul/23  Updated: 15/Sep/23  Resolved: 15/Sep/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Stephen Paul Adithela Assignee: Antonio Fuschetto
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Sharding EMEA
Operating System: ALL
Sprint: Sharding EMEA 2023-08-21, Sharding EMEA 2023-09-04, Sharding EMEA 2023-09-18
Participants:

 Description   

Hello Guys,

We have been facing an issue with one of our sharded mongo cluster (community edition). 

Cluster Mongo Version: 4.4.18

NShards: 3

Data nodes per shard: 4 + 1 arbiter

Config nodes: 5

Individual mongos clients for all the service machines (~18)

 

A client which is read heavy is facing the issue of mongos querying data from only 1 shard instead of all 3 shards. The queried collection is a sharded collection.

We tried to run flushRouterConfig cmd before the scripts process that collection to make sure it refreshes its cache to recognize its a sharded collection. but its intermittent.

 

For example:
Running flushRouterConfig on database level and that particular collection will generate the following log entry:

{"t":{"$date":"2023-07-17T00:10:02.074+00:00"},"s":"I",  "c":"SH_REFR",  "id":24104,   "ctx":"ConfigServerCatalogCacheLoader-6957","msg":"Refreshed cached collection","attr":{"namespace":"dbname.pe_20230716","newVersion":{"0":{"$timestamp":{"t":1,"i":2}},"1":{"$oid":"64a399837a590637cc1a9994"}},"oldVersion":"","durationMillis":1}}
{"t":{"$date":"2023-07-17T01:30:02.216+00:00"},"s":"I",  "c":"SH_REFR",  "id":24104,   "ctx":"ConfigServerCatalogCacheLoader-6976","msg":"Refreshed cached collection","attr":{"namespace":"dbname.pe_20230716","newVersion":{"0":{"$timestamp":{"t":1,"i":2}},"1":{"$oid":"64a399837a590637cc1a9994"}},"oldVersion":"","durationMillis":5}}
 

However, a log from ctx ConfigServerCatalogCacheLoader saying that that collection is indeed refreshed is expected after these lines which we don't observe most of time (sometime works). For example:

{"t":{"$date":"2023-07-18T04:18:36.325+00:00"},"s":"I",  "c":"SH_REFR",  "id":24104,   "ctx":"ConfigServerCatalogCacheLoader-0","msg":"Refreshed cached collection","attr":{"namespace":"dbname.pe_20230730","newVersion":{"0":{"$timestamp":{"t":1,"i":2}},"1":{"$oid":"64b60e8d7a590637cc95899e"}},"oldVersion":"","durationMillis":2}} 

Our current alternative we resorted to after lots of issues is to restart mongos to force refresh and cache all collections and databases just before beginning of our critical processes.

Could you advise on this?

 

Thanks & Regards

 



 Comments   
Comment by Stephen Paul Adithela [ 15/Sep/23 ]

Hi Antonio, Talked to team and no comments from them so we can close the ticket if nothing else pending. Thanks

Comment by Antonio Fuschetto [ 14/Sep/23 ]

Hi stephenpaul2727@gmail.com,

Please let me know if we can close this ticket or if you need any additional information.

Thanks,
Antonio

Comment by Matt Panton [ 08/Sep/23 ]

Hello, stephenpaul2727@gmail.com I am PM for antonio.fuschetto@mongodb.com's team and the specific Jira ticket that fixes this issue is SERVER-32198.  Please note that fixes targeted to version 4.7 were released in MongoDB 5.0.

If you have further questions regarding this matter please let me know. Thanks!

Comment by Stephen Paul Adithela [ 06/Sep/23 ]

Thanks antonio.fuschetto@mongodb.com for the analysis. could you share the bugfix Jira ticket the fixes this in 5.x releases (for my own edifice). Maybe its mentioned in release notes?

We also have another big mongo cluster running 4.4.x which has 28 mongos routers. We will be upgrading this cluster as well to 5.x

 

 

Comment by Antonio Fuschetto [ 06/Sep/23 ]

Hi stephenpaul2727@gmail.com,

The MongoS log of the clone sharded cluster was pretty useful in determining the root cause of the problem.

In version 4.4, deployments with a large set routers can suffer of stale cached information on sharded collections. Under specific circumstance, a router might assume that a collection is unsharded when in reality it would be sharded. As a consequence, the impacted router forwards request only to the primary shard of the collection until the next refresh.

A possible workaround to this issue is to run the flushRouterConfig command on all routers after running shardCollection command and before any CRUD operations on the collection.

Otherwise you could upgrade the cluster to a MongoDB version 5.0 (or later), which implements a more robust routing protocol that is free from this problem. This represents the recommended solution, also because you would benefit from a long series of improvements compared to the version currently used.

Please let me know if you need any additional information.

Thanks,
Antonio

Comment by Antonio Fuschetto [ 06/Sep/23 ]

Hi stephenpaul2727@gmail.com,

I confirm that we have access to you files. We will update you as soon as we have analyzed this data.

Thanks,
Antonio

Comment by Stephen Paul Adithela [ 05/Sep/23 ]

Sorry for the late reply.

we have created a clone of the original reports machine (where the issue happened). On the clone machine, we disabled the mongos restart routine and flush routine, Today the reports on the clone machine had failed (after 14 days) due to the same issue we faced on original reports machine.

i have uploaded the following pieces of data to the portal now:

clone-sh-stat.out (clone reports machine sh.status())

orig-sh-stat.out (original reports machine sh.status())

clone-wifiPe-0904-find.out (clone reports machine wifiPe_20230904.count())- same as the original

clone-wifiPe-0904-sharddist.out (clone reports machine wifiPe_20230904.getShardDistribution())

orig-wifiPe-0904-sharddist.out (original reports machine wifiPe_20230904.getShardDistribution())

cfg1.tgz - full db path copy tarball of db-cfg1 server (we have total of 5 cfg nodes)

cfg4.tgz - full db path copy tarball of db-cfg4 server (we have total of 5 cfg nodes)

orig-mongos-log.out - mongos log content between 00:00 UTC and 03:00 UTC (in this log, we see mongos restart around 01:23 UTC and also flushRouterConfig around 01:00 UTC)

clone-mongos-log.out - mongos log content between 00:00 UTC and 03:00 UTC (issue happens at 01:30 UTC - no mongos restarts or flushrouterconfig here)

 

Could you ack if you have received the files? Thanks

Comment by Chris Kelly [ 21/Aug/23 ]

Chiming in a bit late - here is a new upload portal link for the additional data requested.

Comment by Antonio Fuschetto [ 17/Aug/23 ]

Hi stephenpaul2727@gmail.com,

It would be very useful for the analysis to have the output of the find({}), getShardDistribution() and sh.status() commands respectively:

  1. When the problem occurs
  2. After running a flushRouterConfig (to figure out the reasons why it doesn't work)
  3. After a MongoS restart

As far the data files of the config server, they are stored in the directory specified by the --dbpath config server's option. The complete backup procedure is described on the official documentation.

chris.kelly@mongodb.com, could you provide a secure upload portal for the data files? Or can Stephan reuse the one you previously provided?

Thanks,
Antonio

Comment by Stephen Paul Adithela [ 16/Aug/23 ]

Hi Antonio, Could you please provide a secure link for the artifacts?

Regarding #1, right now, we are not using the flushRouterConfig command to refresh mongos cluster metadata as it doesn't seem to work. we are restarting the whole mongos client process on the machine to avoid this issue. This seems to work well so far and no issues. mongos restart triggers a full refresh of all cached collections like log below:

Unknown macro: {"t"}

,"s":"I",  "c":"SH_REFR",  "id":24104,   "ctx":"ConfigServerCatalogCacheLoader-0","msg":"Refreshed cached collection","attr":{"namespace":"sky.wifiCollection_20230816","newVersion":{"0":{"$timestamp":{"t":1,"i":2}},"1":{"$oid":"64cc78077a590637cc7e3be5"}},"oldVersion":"","durationMillis":0}}

I can also share getShardDistribution and sh.status() before the mongos restart and after if that is okay with you 

Regarding #2, you mean the diagnostics.data of the config nodes right?

Comment by Antonio Fuschetto [ 16/Aug/23 ]

Hi stephenpaul2727@gmail.com,

The FTDC data files are not particularly useful in this scenario, so more information is needed to determine the nature of the problem. Specifically:

  1. Given a namespace where you are experiencing the problem, with more than one document stored, please provide the output of getShardDistribution and sh.status() commands before and after to force a router cache refresh.  Before that, please ensure to insert at least one document on all the available shards (based on the configured sharded key). Please report the full console output, including both issued commands and their output.
  2. Additionally, upload the data files of the config server (these files store only sharded cluster metadata, so no any business/sensitive information is available).

Thanks,
Antonio 

Comment by Stephen Paul Adithela [ 10/Aug/23 ]

Thanks Chris, I see this is now assigned to sharding team in EMEA. It seems like it was internally understood that some bug might be affecting specific scenarios like ours. Would your team be okay to share what you are observing so far from shared diagnostics and what would be fixed? Would this fix be part of upcoming 4.4.24 patch release perhaps? Thanks again for looking into this

Comment by Chris Kelly [ 02/Aug/23 ]

Hi stephenpaul2727@gmail.com,

Thanks for your patience. I just wanted to let you know we received your .tar and are looking into this. 

Comment by Stephen Paul Adithela [ 26/Jul/23 ]

Hi Team, Any updates from your end on this case?

Comment by Stephen Paul Adithela [ 21/Jul/23 ]

I have uploaded the data now. Thanks!!

Comment by Chris Kelly [ 20/Jul/23 ]

Sorry, it doesn't look like we've gotten any files in that bucket. I've made a new link for you here - you can try that one and see if it works better. Alternatively, that data can be uploaded to this ticket if you like.

Comment by Stephen Paul Adithela [ 19/Jul/23 ]

As an aside, it looks like the balancer is disabled on wifiPe_20230719 because of 'noBalance': true. This means that MongoDB won't automatically move chunks between shards to balance the distribution of data in this collection. This also would seem to imply you are relying on the hashed shard key to balance across single jumbo chunks instead of relying on chunk migration from the mongos' balancer.

Is there a compelling reason you chose to have single chunks and the balancer disabled? This is not a typical recommended setup. These huge jumbo chunks can become a bottleneck, especially if the shard key value occurs with high frequency. The mongos would definitely be interesting around the time period you are referencing.

Regarding this, yes, we are using this approach for a while as the chunk splitter (autosplit) creates too many chunks (< 64 mib per each chunk as designed) and this is causing the load on the whole sharded cluster to go significantly higher. Autosplit also disabled in our cluster. 

These dated collections are set to be expired in week to less than a month and dropped. So we don't want additional overhead of rebalancing these huge collections affecting our cluster performance. There are other collections that are long-term and we do let the balancer take care of those collections.

Once we get the diagnostic data, I am going to assign this ticket to the relevant team to gauge whether there are any weird edge case behavior we could expect around this sort of deployment, whether there are any follow-up questions, or whether we want to continue investigating this as a possible bug.

I have uploaded the diagnostic data to portal already. Could you confirm from your end?

Thanks!!

Comment by Chris Kelly [ 19/Jul/23 ]

stephenpaul2727@gmail.com, can you try to re-upload to that link? I didn't see any files there. Alternatively you can upload them to the ticket directly if you like.

  • I don't have enough compelling evidence to implicate the mongos in anything here yet. nShards can change based on the nature of the query, where the data exists, or whether the shard key was used in a query (and a scatter-gather was performed across all shards). In itself, and without logs, this appears to be expected.
  • If you're seeing more nShards: 1 than usual, this would be expected to be due to:
    • changes in your query patterns (accessing data in the chunk on node 1 more often)
    • changes in the data
  • Your shard key is hashed:
    • This means that each shard should receive approximately the same number of writes, assuming the shard key has a high cardinality (many unique values). This appears to be true from your distribution across shards.
    • If your queries often need to retrieve documents based on the shard key, and the query condition is an equality condition or an exact match (e.g., find({shardKey: value})), then the mongos router can target the query directly to the shard that holds the matching document(s), similar to writes.
      • If that data happens to be on shard 1, that would make explain an increase in writes to shard 1. However, this would seem to be unlikely if you are evenly distributing across each shard (which appears to be true)

As an aside, it looks like the balancer is disabled on wifiPe_20230719 because of 'noBalance': true. This means that MongoDB won't automatically move chunks between shards to balance the distribution of data in this collection. This also would seem to imply you are relying on the hashed shard key to balance across single jumbo chunks instead of relying on chunk migration from the mongos' balancer.

Is there a compelling reason you chose to have single chunks and the balancer disabled? This is not a typical recommended setup. These huge jumbo chunks can become a bottleneck, especially if the shard key value occurs with high frequency. The mongos would definitely be interesting around the time period you are referencing.

Once we get the diagnostic data, I am going to assign this ticket to the relevant team to gauge whether there are any weird edge case behavior we could expect around this sort of deployment, whether there are any follow-up questions, or whether we want to continue investigating this as a possible bug.

Christopher

 

Comment by Stephen Paul Adithela [ 19/Jul/23 ]

It sounds like flushRouterConfig is a bit of a distraction here, does it actually help with this behavior normally? Has its behavior suddenly changed? If so, pointing out specific timestamps this occurs would also be helpful.

Regarding this, we have slow queries that are logged by mongos and we noticed that wifiPe_2023xxxx dated collection query log is logged with nShards: 1 (instead of 3), this is how it was noticed that the mongos is only querying 1 shard instead of 3. When the flushRouterConfig was ran, Sometimes, it does refresh the collection cache and if we run our script later, nShard will be 3. Sometimes, even with that cmd run, mongos client cache for that collection doesn't refresh. a mongos client restart is solving the problem in this case as of now. When mongos starts, it usually re-caches all the config state of every db, collection i believe

Comment by Stephen Paul Adithela [ 19/Jul/23 ]

Thanks for your reply chris.kelly@mongodb.com , I can confirm the shard distribution of the collection is not the cause. 

We have a specific dated collection that was created by one of our application every day and we ensure its created with chunks equalling the shard count and later when writing all data is distributed across all 3 shards equally. Here is a snippet of some cmds:

for example, wifiPe_20230719 is the dated collection created today with 3 chunks in 3 shards respectively which will be populated later. also wifiPe_20230718 is the dated collection created yesterday which was fully populated 

mongos> use test
switched to db test
mongos> use config
switched to db config
mongos> db.collections.findOne({_id: "test.wifiPe_20230719"})
{
        "_id" : "test.wifiPe_20230719",
        "lastmodEpoch" : ObjectId("64a78e057a590637ccd200c3"),
        "lastmod" : ISODate("1970-02-19T17:02:47.298Z"),
        "dropped" : false,
        "key" : {
                "lineId" : "hashed"
        },
        "unique" : false,
        "uuid" : UUID("997789ed-6736-4079-b6f0-5feee3ecabcd"),
        "distributionMode" : "sharded",
        "noBalance" : true
}
mongos> use test
switched to db test
mongos> db.wifiPe_20230719.getShardDistribution()Shard rs3 at rs3/test-db-03a:27018,test-db-03b:27018,test-db-03c:27018,test-db-03d:27018
 data : 0B docs : 0 chunks : 1
 estimated data per chunk : 0B
 estimated docs per chunk : 0Shard rs2 at rs2/test-db-02a:27018,test-db-02b:27018,test-db-02c:27018,test-db-02d:27018
 data : 0B docs : 0 chunks : 1
 estimated data per chunk : 0B
 estimated docs per chunk : 0Shard rs1 at rs1/test-db-01a:27018,test-db-01b:27018,test-db-01c:27018,test-db-01d:27018
 data : 0B docs : 0 chunks : 1
 estimated data per chunk : 0B
 estimated docs per chunk : 0Totals
 data : 0B docs : 0 chunks : 3
 Shard rs3 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard rs2 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard rs1 contains 0% data, 0% docs in cluster, avg obj size on shard : 0Bmongos> db.wifiPe_20230718.getShardDistribution()Shard rs2 at rs2/test-db-02a:27018,test-db-02b:27018,test-db-02c:27018,test-db-02d:27018
 data : 106.93GiB docs : 2009122 chunks : 1
 estimated data per chunk : 106.93GiB
 estimated docs per chunk : 2009122Shard rs3 at rs3/test-db-03a:27018,test-db-03b:27018,test-db-03c:27018,test-db-03d:27018
 data : 107.01GiB docs : 2009791 chunks : 1
 estimated data per chunk : 107.01GiB
 estimated docs per chunk : 2009791Shard rs1 at rs1/test-db-01a:27018,test-db-01b:27018,test-db-01c:27018,test-db-01d:27018
 data : 106.85GiB docs : 2007546 chunks : 1
 estimated data per chunk : 106.85GiB
 estimated docs per chunk : 2007546Totals
 data : 320.8GiB docs : 6026459 chunks : 3
 Shard rs2 contains 33.33% data, 33.33% docs in cluster, avg obj size on shard : 55KiB
 Shard rs3 contains 33.35% data, 33.34% docs in cluster, avg obj size on shard : 55KiB
 Shard rs1 contains 33.3% data, 33.31% docs in cluster, avg obj size on shard : 55KiBmongos>
 

Comment by Chris Kelly [ 19/Jul/23 ]

Thanks for the prompt reply stephenpaul2727@gmail.com. I want to preface this on the fact that asymmetric performance like this is likely to be solved first by checking out our documentation or reaching out on the community forums.

It sounds like flushRouterConfig is a bit of a distraction here, does it actually help with this behavior normally? Has its behavior suddenly changed? If so, pointing out specific timestamps this occurs would also be helpful.

Before proceeding further, please review your sharding implementation to ensure it's being used effectively. Here are some recommendations I would like you to fully consider, and share the results of, before we engage further in an server bug investigation.

  • If a mongos is directing queries to one shard over another, it may be helpful to also have information such as shard distribution. See https://www.mongodb.com/docs/manual/reference/method/db.collection.getShardDistribution/ for an example. 
  • It's possible that your shard key choice or the distribution of the data within the key is causing the uneven load. A well-chosen shard key should evenly distribute writes and queries across shards. Check your shard key design and the chunk distribution with the command sh.status().
  • It is especially significant to confirm whether you are using a monotonically increasing shard key, since a shard key on a value that increases or decreases monotonically is more likely to distribute inserts to a single chunk within the cluster.

The root of this problem is often related to data distribution. If data is not evenly distributed across the shards, MongoDB will naturally end up querying the shard with the most data more often. If this is the case, you might want to consider rebalancing your shards. This could involve changing your shard key, splitting your chunks, resharding, or manually balancing your data around your existing shard key more effectively.

If your issue persists, and you suspect this is a server bug and not an implementation issue, we can proceed once the results above are provided. 

Christopher

Comment by Stephen Paul Adithela [ 19/Jul/23 ]

Hi chris.kelly@mongodb.com ,

Thanks for taking the time with this issue. I have uploaded the diagnostics-data tgz file to the upload portal from all data nodes and cfg nodes.

For mongod logs, we do log some sensitive queries so i am unsure if i can upload those but will ask internally. 

Would you be able to do some analysis with the diagnostics info itself?

Thanks,

Comment by Chris Kelly [ 19/Jul/23 ]

Hi stephenpaul2727@gmail.com,

Thanks for your report. To begin investigating the issue, it would be helpful to have some diagnostic data.

I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Thank you!

Christopher

Generated at Thu Feb 08 06:39:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.