Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 6.0.6
Component/s: None
Labels:
- Bug

Assigned Teams:

Query Execution
Operating System:
ALL
Sprint:
QE 2023-09-04, QE 2023-09-18, QE 2023-10-02, QE 2023-10-16, QE 2023-10-30, QE 2023-11-13
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Very similar to what was observed here:

https://jira.mongodb.org/browse/SERVER-42232

Seeing the same issue in MongoDB 6.0.

More info:

Using MongoDB 6.0.6.
Having a DB with sharded collections.
The DB has a few shards, every shard is in a P-S-A (2 node replicas and an arbiter).
My application is using change-streams to gather statistics, and is constantly reading for operations, while also maintaining the clusterTime timestamp in case the the app crashes, so we can continue from the last handled point.

For the shards, we are self-maintaining the storage, and decide on adding a shard from time to time.
Been noticing that when a shard is added the following happens:

added shard at ~9:15 AM, shard comes up, rebalancing its data with other shards

few hours later at ~12:30 PM, Resume of change stream was not possible errors start showing up, coming from the port of the new added shard.

{"t":{"$date":"2023-08-02T06:32:40.283+00:00"},"s":"W",  "c":"QUERY",    "id":20478,   "ctx":"conn188","msg":"getMore command executor error","attr":{"error":{"code":286,"codeName":"ChangeStreamHistoryLost","errmsg":"Resume of change stream was not possible, as the resume point may no longer be in the oplog."},"stats":{},"cmd":{"getMore":2383202557420931747,"collection":"$cmd.aggregate","maxTimeMS":1000}}}

also, the first available event in the new shard is delayed very much, here we added the shard around ~9:30 AM, but the first available event was only at ~14:30 PM, many hours later, but for other shards the first even is before that, so not sure how this makes sense because this is a totally new shard.

In this case, the client, which is connected to the mongos router, is unable to proceed unless we move the start_at_operation_time pointer to 14:30 PM so it can continue reading (and this also makes us lose all the update from ~12:30 PM to ~14:30 PM which is not acceptable).

Why is this happening? isn’t the change-stream suppose to continue regularly and add the new shard updates once its ready? failing like this does not look like normal behavior.
Is there a safe way to add a shard and keep reading incoming updates for the other shards through the mongos router without getting stopped by this un-synced shard?
Is this happening because of the P-S-A configuration, and will not occur in P-S-S? if yes, why is that?

Following this community thread, were the question was raised few days ago but with no answer:

https://www.mongodb.com/community/forums/t/change-stream-resume-point-lost-after-adding-a-new-shard/232005

related to

SERVER-42232 Adding a new shard renders all preceding resume tokens invalid

Closed

SERVER-42290 Target change streams to the primary shard when possible

Backlog

Assignee:: Romans Kasperovics
Reporter:: Oded Raiches
Participants:: Ana Meza, Oded Raiches, Romans Kasperovics, Yuan Fang
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Jun 22 2023 01:09:51 PM UTC
Updated:: Nov 06 2023 01:15:51 PM UTC
Resolved:: Nov 06 2023 01:15:50 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates