[SERVER-44883] Sharded $merge may fail with NotMaster or CursorNotFound if run immediately after a failover on one shard Created: 29/Nov/19 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Bernard Gorman | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | qexec-team | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Query Execution
|
| Sprint: | Query 2019-12-30, Query 2020-01-13 |
| Participants: |
| Description |
|
When shard A's Primary steps down and a new Primary is elected, there is a window of time during which the ReplicaSetMonitor on shards other than A still believe the original node is shard A's Primary. It is possible for a $merge to be issued during this window which is dispatched to shard B and will then fail with a NotMaster exception if it attempts to write to shard A, since the writes will be targeted towards a node which is now Secondary. It appears that it is also possible for a CursorNotFound exception to result due to a NotMasterNoSlaveOk exception on shard A if the $merge is dispatched to a shard as part of the latter half of a split pipeline. |
| Comments |
| Comment by Bernard Gorman [ 02/Dec/19 ] |
|
david.storch: done. Note that I didn't investigate this much beyond what is outlined in the description; it arose while I was writing a multiversion test, and once I had established what was happening I just worked around it, since it was not relevant to what I was attempting to test. |
| Comment by David Storch [ 02/Dec/19 ] |
|
bernard.gorman can you provide a description? |