[SERVER-44356] Change Stream latency is determined by config server write rate Created: 01/Nov/19  Updated: 08/Jan/24  Resolved: 16/Feb/20

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 4.3.4

Type: Improvement Priority: Major - P3
Reporter: Bernard Gorman Assignee: Bernard Gorman
Resolution: Fixed Votes: 0
Labels: qexec-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-42723 New shard with new database can be ig... Closed
is related to SERVER-80427 Avoid change streams latency caused b... Backlog
Backwards Compatibility: Fully Compatible
Sprint: Query 2019-12-02, Query 2019-12-16, Query 2019-12-30, Query 2020-01-13, Query 2020-01-27, Query 2020-02-10, Query 2020-02-24
Participants:
Linked BF Score: 0

 Description   

In SERVER-42723, we introduced an improved mechanism for change streams to detect the addition of new shards to the cluster. This involves opening an internal cursor on the config servers alongside the shard streams, in order to monitor for writes to the config.shards collection. While this approach guarantees that we will always correctly pick up new shards, it has the effect of making the change stream's latency dependent upon the config server write rate; since the config server cursor is treated as just another shard in the sorted stream from the ARM's perspective, the latter cannot return results from any shards until the config server writes an oplog entry which surpasses the clusterTime of those events.

In addition to the writes that the config servers perform to persist metadata changes, each mongoS in the cluster pings the config servers every 10s and every component in the cluster lockpings every 30 seconds. A large and/or busy cluster will therefore not suffer much additional latency. But in the scenario where a cluster has only a single mongoS and is not actively splitting or migrating, our latency guarantees go from a worst-case of ~10s if one of the shards in the cluster is completely inactive, to a minimum of ~10s regardless of how active the shards are.

There are a couple of ways we could address this:

  • Ask Sharding if they are prepared to enable writePeriodicNoops on the config servers by default, and set it to a high frequency of maybe 1-2 seconds to minimize the impact on change streams. Each no-op entry is 103 bytes, which at even 1-second frequency is only ~362kB per hour before compression.
  • Use both the old and new shard-detection mechanisms. The new approach is applicable to all cases, but only strictly necessary for (1) whole-cluster streams, and (2) streams which are opened on a database that does not yet exist. Aside from those cases, we do not need to open a cursor on the config servers at all. In the case of (2), we could also close the config.shards cursor as soon as we see the first event in the stream.
  • Do nothing and consider this another flavour of activity-dependent latency.


 Comments   
Comment by Githook User [ 16/Feb/20 ]

Author:

{'username': 'gormanb', 'name': 'Bernard Gorman', 'email': 'bernard.gorman@gmail.com'}

Message: SERVER-44356 Set config server periodicNoopIntervalSecs to 1
Branch: master
https://github.com/mongodb/mongo/commit/1d21e89fa5c9d838b2f40045c26fe41f306a8873

Comment by Kaloian Manassiev [ 28/Jan/20 ]

We just had a meeting about this with jack.mulrow, esha.maharishi and renctan:

Jack pointed out to me that the gossiped configOpTime is actually the majority committed opTime on the config server and not the last written, like the clusterTime. As a consequence of this, it means that reads from the config server (which are w:majority, afterOpTime: configOpTime) are less likely to block waiting on the opTime to become majority written. This in turn means that enabling the NoopWriter on the config server is actually not as problematic as I originally thought.
 
So in summary, we can just enable the NoopWriter on the config server in order to address this problem and there is no need to rely on the topologyOpTime.

Comment by Bernard Gorman [ 10/Jan/20 ]

kaloian.manassiev: I'm not sure that will address the original problem, as outlined in SERVER-42723. We already have an oplog event migrateChunkToNewShard which is generated whenever a chunk is moved to a shard that doesn't have any chunks; this was the original way that change streams detected the addition of new shards. The problem identified in SERVER-42723 is that, if a new shard is added and a new database is created on that shard, then a whole-cluster change stream will miss all events that occur on the shard until the first chunk migrates to it. The problem is actually somewhat more extensive than this simple example; see this comment for more details.

How will the new clusterTopologyOpTime help us to avoid this scenario?

Comment by Kaloian Manassiev [ 08/Jan/20 ]

Apologies for the delay in getting back to this. I didn't realise it had impact on driver testing as well and I thought it was about performance only.

Enabling the NoopWriter on the config server will introduce unnecessary waits for metadata reads, which in turn can have negative and possibly unpredictable impact on the latency of CRUD operations, which end up having to do routing info refresh. One such example is that let's say a migration commits (with majority) and produces an opTime 100, but before it returns to the caller shard, the NoopWriter advances it up to 101. Now the refresh needs to wait for 101 to become majority committed instead of 100.

Because of this I am hesitant just making this change.

Instead, what we are proposing to do is to introduce a clusterTopologyOpTime vector clock component, which gets advanced only when shards are added and removed. This, combined with the causal consistency provided for reads against the metadata, will ensure that if any chunk is removed to a newly added shard, any change streams will see the newly added shard when they see the chunk migration.

This is not something which is quick to do though and will take at least a couple of iterations.

Generated at Thu Feb 08 05:05:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.