[SERVER-33090] $changeStream can fail if a shard was removed Created: 02/Feb/18  Updated: 30/Jan/23

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: 3.6.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: change-streams-improvements, query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File change_stream.diff    
Issue Links:
Depends
Related
related to SERVER-33778 Remove change_stream_remove_shard.js Closed
is related to SERVER-44039 Improve $changeStream handling of clu... Backlog
Assigned Teams:
Query Execution
Operating System: ALL
Sprint: Query 2018-02-12, Query 2018-02-26
Participants:
Linked BF Score: 0

 Description   

because mongos currently does not allow remote queries on shards that were removed. It is also unclear from the user point of view when is it safe to physically take out the shard that was removed and not make the change stream lose track of some of the writes.

Possible fix is to make it such that removeShard and changeStream work together:

  • removeShard will insert an 'end of life' sentinel oplog entry
  • changeStream sees the sentinel and kills itself
  • removeShard will not return success until there are no more change stream cursors in the shard

In this scheme, the user will know that it is safe to completely decommission the shard once removeShard completes.



 Comments   
Comment by Randolph Tan [ 02/Mar/18 ]

charlie.swanson The reason why it happens occasionally is because of the timing of when mongos realize that the shard was removed. The refresh currently happens every 30 seconds and if the refresh just happened right after the shard has been removed and right before the remote command is about to be sent, you will get this error. I have attached a diff that can consistently reproduce this failure.

Comment by Kaloian Manassiev [ 02/Mar/18 ]

renctan, can you please elaborate here what the problem is and if appropriate put it back on the query team's backlog?

Comment by Ian Whalen (Inactive) [ 23/Feb/18 ]

Closing as Incomplete and moving the related Build Failure back to Open.

Comment by Charlie Swanson [ 16/Feb/18 ]

Ping renctan.

Comment by Charlie Swanson [ 12/Feb/18 ]

renctan I don't think we know enough here, and I'm confusing myself trying to figure out how all this networking stuff happens.

The assertion you point to indicates that the AsyncResultsMerger could not perform a getMore on an existing cursor. I don't understand why this is the case. The error message there looks like it comes from here. From what I can understand, it looks like we are prevented from opening a connection to a shard that has been removed from the cluster. This broadly would make sense, but in this case we have already opened a cursor on that shard and we would like to finish iterating it before the host is stopped.

Do you have any insight into why this won't happen on every run, but only occasionally? One guess is that usually the connection stays open, but things may have been slow enough here that it got dropped before we sent the next getMore?

Generated at Thu Feb 08 04:32:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.