[SERVER-33090] $changeStream can fail if a shard was removed Created: 02/Feb/18 Updated: 30/Jan/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 3.6.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Randolph Tan | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | change-streams-improvements, query-44-grooming | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Query 2018-02-12, Query 2018-02-26 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||
| Description |
|
because mongos currently does not allow remote queries on shards that were removed. It is also unclear from the user point of view when is it safe to physically take out the shard that was removed and not make the change stream lose track of some of the writes. Possible fix is to make it such that removeShard and changeStream work together:
In this scheme, the user will know that it is safe to completely decommission the shard once removeShard completes. |
| Comments |
| Comment by Randolph Tan [ 02/Mar/18 ] |
|
charlie.swanson The reason why it happens occasionally is because of the timing of when mongos realize that the shard was removed. The refresh currently happens every 30 seconds and if the refresh just happened right after the shard has been removed and right before the remote command is about to be sent, you will get this error. I have attached a diff that can consistently reproduce this failure. |
| Comment by Kaloian Manassiev [ 02/Mar/18 ] |
|
renctan, can you please elaborate here what the problem is and if appropriate put it back on the query team's backlog? |
| Comment by Ian Whalen (Inactive) [ 23/Feb/18 ] |
|
Closing as Incomplete and moving the related Build Failure back to Open. |
| Comment by Charlie Swanson [ 16/Feb/18 ] |
|
Ping renctan. |
| Comment by Charlie Swanson [ 12/Feb/18 ] |
|
renctan I don't think we know enough here, and I'm confusing myself trying to figure out how all this networking stuff happens. The assertion you point to indicates that the AsyncResultsMerger could not perform a getMore on an existing cursor. I don't understand why this is the case. The error message there looks like it comes from here. From what I can understand, it looks like we are prevented from opening a connection to a shard that has been removed from the cluster. This broadly would make sense, but in this case we have already opened a cursor on that shard and we would like to finish iterating it before the host is stopped. Do you have any insight into why this won't happen on every run, but only occasionally? One guess is that usually the connection stays open, but things may have been slow enough here that it got dropped before we sent the next getMore? |