[SERVER-65949] Sharded time series leads to high query rates on the primary shard with CommandOnShardedViewNotSupportedOnMongod errors Created: 26/Apr/22  Updated: 26/Dec/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: 5.0.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Marnu vd Merwe Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 1
Labels: qi-timeseries, read-only-views, sharding, time-series
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File All shards mongostat.png     PNG File Mongotop.png     PNG File Multi-shard Metrics overview 1.png     PNG File Multi-shard Metrics overview 2.png     PNG File Screenshot 2022-04-26 104816.png     PNG File Shard mongo-stat.png     Zip Archive data - attempt2.zip     Zip Archive data and loadtester2.zip     PNG File image-2022-06-09-13-59-20-641.png     PNG File image-2022-06-09-14-04-39-666.png     PNG File image-2022-06-14-10-20-55-809.png     PNG File shard1-t2-v2.png     PNG File shard1-t2.png     PNG File shard1.png     PNG File shard2-t2.png     PNG File shard2.png    
Issue Links:
Related
is related to SERVER-57437 Implement insert routing for sharded ... Closed
is related to SERVER-42282 Consider replacing CommandOnShardedVi... Backlog
Assigned Teams:
Query Integration
Operating System: ALL
Steps To Reproduce:

My setup:

  • Everything is running locally
  • Docker Desktop
  • Mongo 5.0.7
  • 2 shards, with 3 servers in a replica set per shard
  • 1 mongos
  • 1 config replica set with 3 servers
  • Load tester application that pushes a lot of time series data, while simultaneously doing reads, written using the .NET C# driver, v. 2.15.0

Reproduction steps:

  1. Setup a 2 shard cluster, even in docker
  2. Create a sharded time series collection, and insert few records.  I added a few million records.  The destination bucket of each record doesn't make a difference, meaning the meta field can vary i.e. different sensor IDs.
  3. Perform a sustained, heavy read load on the collection.
  4. Observe that the primary shard for the time series view processes a lot of get queries, while the other shard processes none.  This can be seen in the mongostat output attached, where shard1 on the left processes 1k+ "query" ops, but shard2 on the right processes none of these "query" ops
  5. Observe the following logs entry on mongos
    1. "ctx":"conn1516","msg":"Unable to establish remote cursors","attr":{"error":{"code":169,"codeName":"CommandOnShardedViewNotSupportedOnMongod","errmsg":"Resolved views on sharded collections must be executed by mongos","resolvedView":{"ns":"EndpointDataProtoTests.system.buckets.EndpointData:Endpoints-NormalV8","pipeline":[{"$_internalUnpackBucket":{"timeField":"t","metaField":"m","bucketMaxSpanSeconds":3600,"exclude":[]}}],"collation":{"locale":"simple"}}},"nRemotes":0}}

The attached image is a mongostat output that was captured by NoSqlBooster 7.1.  The left side is a direct connection to the primary of shard1, and the right a direct connection to the primary of shard2

Participants:

 Description   

Sharding a time series collection leads to higher throughput on write, but reads affect the whole cluster, because the primary shard has a spike in CPU usage.  When reviewing the logs of Mongos, several log entries state that Resolved views on sharded collections must be executed by mongos.  When I stop the read load, these messages are no longer logged.

From my research, it seems like this can be related to SERVER-43376 - Operations on non-sharded views in sharded clusters extra round trip

This is a problem for us, because adding a read load affects the whole cluster's performance.  Our workload has about 25% reads for every 100% of writes.

I found the problem while load testing my sharded time series prototype on Atlas



 Comments   
Comment by Matt Panton [ 10/Jan/23 ]

To resolve this issue and increase performance for views/time series the sharding catalog must track views which is an objective in the linked ticket PM-1966 (Track Views in Sharding Catalog)

Comment by Marnu vd Merwe [ 13/Jun/22 ]

Hi Chris,

Many thanks!  I appreciate the feedback!  

I originally spotted the issue in Atlas while streaming data from a production workload into my PoC Atlas environment.  I played around with time series collections to determine how we should provision a cluster for our current workload (10 shards M60 (Mongo 4.4) instances with NVMe storage and with batched, unordered inserts).  I stopped scaling my cluster when I noticed that one shard had very high CPU usage, and the others were at about 50%-75% CPU usage of the main shard.  The attached screenshots show the CPU utilizations of one of my setups (I think this was a 4 shard, M50 setup on Mongo 5.0.7), with a workload quite near to my the attached test app.   During this test, I had an index on SensorId_1_EntryTime_1 (the meta field only had the SensorId inside).

I hope this gives the team a bit more to work with.

Kind regards,

Marnu

Comment by Chris Kelly [ 13/Jun/22 ]

Hi Marnu,

Thanks for your patience, and for providing extra context around your report. It seems this issue may be replicating SERVER-43376 because timeseries collections are essentially behaving as views themselves. Specifically, MongoDB treats time series collections as writable non-materialized views on internal collections. To improve query performance, you could also try manually adding secondary indexes on the fields specified as the metaField and the timeField.

It is possible in this case that these views would trigger the known extra roundtrip issue in SERVER-42282 which is currently in the backlog.

As for there being more queries (in general) on the primary shard of the collection, we may need to look a little more. I see the same amount of documents scanned on each shard, even though the total amount of queries sent to the primary shard is much higher. CPU usage seems pretty balanced across your two shards and the mongos, so I'm not 100% sure what specific behavior you're experiencing when it comes to load - there is a minor increase in shard 2 but it's hard to say if this is significant since overall CPU usage is low.

I'll forward this to the sharding team and see if they have any input on this. 

Regards,
Christopher

Comment by Marnu vd Merwe [ 03/Jun/22 ]

Hi Chris,

Thanks for the help thus far. Before we close this ticket; what was your findings regarding reads that stack up on the primary shard?

Also, I want to ask that you please consider analyzing my last diagnostic and log report. I do hear what you are saying about shard keys, but I'm confident in my shard key selection; a non-changing GUID that does not monotonically increase, and is very selective, ensuring that one GUID only ever targets one shard.  My first report that you analyzed had a problem in its queries, but I fixed this in the second one.

Kind regards,

Marnu

Comment by Chris Kelly [ 01/Jun/22 ]

Hey Marnu,

Thanks for the extra info. Given what we know currently, I'd like to recommend asking our community for help first by posting on the MongoDB Developer Community Forums. That would be a more appropriate area to check out your query and shard key specifically, which should get more eyes on it.

If the discussion there leads you to suspect a bug in the MongoDB server, then we'd want to investigate it further as a possible bug here in the SERVER project. I'll close this ticket for now, but if you get more information on this after checking out the forums, feel free to open another and reference this one so we can get back to it.

Regards,
Christopher

Comment by Marnu vd Merwe [ 01/Jun/22 ]

Hi Chris,

Can you help review our code, please?  I see you have a keen eye for finding the errors in my queries

After your mention about monotonically increasing shard keys, and pointing out my queries are not equally distributed, I did a review of my code, and found that my queries targeted a specific shard with every batch.  For attempt 2, I changed the following:

  • I created the collection, and pre-split the collection into 4 chunks, using LUUIDs this time, and not CSUUIDs
  • I added more sensors, with less data per sensor, to ensure we have a more equally distributed load across chunks by increasing the number of unique shard key values
  • I used batch inserts to remove the connection timeouts I got due to the driver waiting for a long time for 1.4 mil inserts
  • I double-checked getShardDistribution on system.buckets.timeseries to ensure that data is distributed equally

My latest attachement, data - attempt2, includes output of:

  • diagnostic data and logs from mongos, and the primary of each shard
  • output of sh.status()
  • output of db.getCollection("system.buckets.timeseries").getShardDistribution()
  • updated load test application

With the latest application, it is easier to see the distributed load being generated, and then to observe the high combined read load.  Something I saw again this time around is that the primary shard for the collection shows a high number of "query" ops, and that both shards show a number of "command" ops, when combined, equal to the number of "query" ops.  My assumption is that every find query is hitting the primary shard for a "get" op, which is then redirected to the correct shard, where the shard does a "command" op to return the data.

Regarding a different shard key; my key is a random GUID, and I have no other key to shard on that will not lead to monotonically increasing keys.  My document contains:

  • SensorId: Random GUID
  • EntryTime: DateTime
  • Value: Sensor's value
  • Id: ObjectId generated by the MongoDB server

Do you have a recommendation on a new shard key?  This document mimics our production setup where we link sensors to GUIDs.

Let me know if you need anything else.

Marnu

Comment by Chris Kelly [ 01/Jun/22 ]

Hi Marnu,

I'm going to look into this further, but I just wanted to share something that might be a good resource for optimizing guid-based sharding by pre-splitting data

Shard 1:

Shard 2:

Shard 2 shows lots of query load for me, but it looks like data is being inserted into both shards. Despite this, most of the queries seem to be on shard 2 still. I also saw these in the logs preceding the "Resolved views on sharded collections must be executed by mongos" error.

{"t":{"$date":"2022-05-31T16:31:50.114+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619902, "ctx":"CatalogCache-3","msg":"Collection has found to be unsharded after refresh","attr":{"namespace":"telemetry.timeseries","durationMillis":1}}
{"t":{"$date":"2022-05-31T16:37:43.862+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619902, "ctx":"CatalogCache-7","msg":"Collection has found to be unsharded after refresh","attr":{"namespace":"telemetry.system.views","durationMillis":4}}
{"t":{"$date":"2022-05-31T16:37:43.866+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619902, "ctx":"CatalogCache-7","msg":"Collection has found to be unsharded after refresh","attr":{"namespace":"telemetry.system.buckets.system.views","durationMillis":3}}

 

I'm not sure what to make of this just yet, but just wanted to keep you in the loop. Could you return the value of running "sh.status()" on your mongos after your testing suite is complete?
Also, If you'd want to try playing around with your shard key in your test driver and seeing if you get different results, that'd be interesting info to have on hand too.

Regards,
Christopher

Comment by Chris Kelly [ 31/May/22 ]

Just to add some more information, I think this is potentially normal behavior if you're using a monotonically increasing shard key, since a shard key on a value that increases or decreases monotonically is more likely to distribute inserts to a single chunk within the cluster.

If you are reading that data quickly after it's inserted, before the chunk is split and migrated to the other shard, it would make sense that you're reading from this one large chunk on only one shard before it has time to be split.

In your case, I think it would be worth consulting some information in the docs, specifically with:

I'll also take a look at your submitted logs as well. 

Christopher

Comment by Marnu vd Merwe [ 31/May/22 ]

Hi Chris,

Thanks for the help!

Please see attached the logs and diagnostic data, as requested, where sh1 and sh2 are the primaries of each shard.  The test was run on MongoDB 5.0.7.

Also included is the .NET 6.0 C# project I used to generate read/write load.

I hope this helps!

Marnu

Comment by Chris Kelly [ 26/May/22 ]

Hello Marnu,

Thanks for your patience on this. I just wanted to let you know that we're taking a look at this.

Currently I can see the following when I try to reproduce this:

  • Queries do in fact stack up on the primary shard for the time-series collection under high read/write load
  • This read load is not observed on the other shard's nodes
  • I do NOT see the messages in the mongos log.

It'd be helpful if you could help provide a little more data to look at. Specifically, it would be very helpful to archive (tar or zip) and upload the following to the ticket:

  • the mongos logs + one mongod log from each shard
  • the $dbpath/diagnostic.data directory (the contents are described here) for mongos and a mongod on each shard
  • the queries that you're inserting to generate load

Regards,
Christopher

Generated at Thu Feb 08 06:04:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.