[SERVER-32088] ChangeStream resumeAfter does not work on sharded collections if not all shards have chunks for the collection Created: 27/Nov/17  Updated: 30/Oct/23  Resolved: 29/May/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Replication, Sharding
Affects Version/s: None
Fix Version/s: 4.0.1, 4.1.1

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Nicholas Zolnierz
Resolution: Fixed Votes: 0
Labels: todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-32190 Make a MongoProcessInterface availabl... Closed
depends on SERVER-34695 Move aggregation retry logic to comma... Closed
is depended on by SERVER-34710 Add test for resuming a change stream... Closed
Problem/Incident
causes PYTHON-1582 Test Failure - TestChangeStream.test_... Closed
Related
related to SERVER-32029 ChangeStream resumeAfter does not wor... Closed
related to SERVER-43475 Complete TODO listed in SERVER-32088 Closed
related to SERVER-44210 Complete TODO listed in SERVER-32088 Closed
is related to SERVER-35254 Resuming a change stream on a stale m... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Query 2018-04-23, Query 2018-05-07, Query 2018-05-21, Query 2018-06-04
Participants:
Linked BF Score: 26

 Description   

Taken from a comment on SERVER-32029:

If a collection is sharded but not present on all shards, then some shards will not know about the collection, and will also mistakenly error upon resuming because of this. This bug is actually harder to fix, because it's hard to know whether the collection doesn't exist because it was dropped, or whether it doesn't exist because you don't own any chunks for it.



 Comments   
Comment by Asya Kamsky [ 01/Oct/18 ]

sjohnson540 we discussed that backport is high-risk. I'm currently researching the value of the backport to 3.6 to assess it against the risk. Can you give us an idea of the impact of having to wait till 4.0 for the fix for you?

Feel free to e-mail me at asya at mongodb.com if you want to discuss things you'd rather not mention publicly.

Comment by Sam Johnson [ 01/Oct/18 ]

asya just curious – when is the next triage meeting planned for?

Comment by Sam Johnson [ 25/Sep/18 ]

Yes they are all a part of the same sharded cluster. Most replica sets have one or two collections on them.

Comment by Asya Kamsky [ 25/Sep/18 ]

sjohnson540 we are going to discuss the backport to 3.6 at our next triage meeting.

We originally decided not to backport it because of complexity of the change but will reconsider it now.

Just to confirm, you observe this happening on a sharded collection (you mention 80 replica sets but not if they are part of a sharded cluster)?

Comment by Sam Johnson [ 25/Sep/18 ]

@NicholasZolnierz I was hoping to get an update on this. Is there any possibility of it being backported to 3.6. It breaks the functionality of Change Streams for our cluster. 

We are currently running ~80 replica sets in production. As such we cannot afford to jump major versions for a bug fix, but if it was backported we could have it rolled out very soon, and would be happy to share our experience running changestreams on large deployments. 

Comment by Sam Johnson [ 26/Jul/18 ]

Hello! I have ran into this issue with our change streams in our staging environment in preparation to deploy to production. Is this fix going to be backported to 3.6.(6?) or will it only be released in 4.0?

Comment by Githook User [ 29/Jun/18 ]

Author:

{'username': 'nzolnierzmdb', 'name': 'Nick Zolnierz', 'email': 'nicholas.zolnierz@mongodb.com'}

Message: SERVER-32088: ChangeStream resumeAfter does not work on sharded collections if not all shards have chunks for the collection

(cherry picked from commit a76082905d63ac8aaaae25e5c76812e6edf9bc07)
Branch: v4.0
https://github.com/mongodb/mongo/commit/8bde61b46f1207ade073d51adbfe9ea9004925cd

Comment by Nicholas Zolnierz [ 20/Jun/18 ]

Thanks greg.mckeon, that's correct this was originally scoped for 4.0. 

Comment by Gregory McKeon (Inactive) [ 20/Jun/18 ]

nicholas.zolnierz I'm under the impression this is intended for 4.0, so I've requested the backport.

shane.harvey so he's aware this may break drivers testing against 4.0 once we backport.

Comment by Ian Whalen (Inactive) [ 20/Jun/18 ]

nicholas.zolnierz does this need to get into 4.0?

CC greg.mckeon.

Comment by Githook User [ 29/May/18 ]

Author:

{'username': 'nzolnierzmdb', 'name': 'Nick Zolnierz', 'email': 'nicholas.zolnierz@mongodb.com'}

Message: SERVER-32088: ChangeStream resumeAfter does not work on sharded collections if not all shards have chunks for the collection
Branch: master
https://github.com/mongodb/mongo/commit/a76082905d63ac8aaaae25e5c76812e6edf9bc07

Comment by Charlie Swanson [ 12/Mar/18 ]

Bumping this out of the sprint in favor of SERVER-32283

Generated at Thu Feb 08 04:29:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.