[SERVER-44883] Sharded $merge may fail with NotMaster or CursorNotFound if run immediately after a failover on one shard Created: 29/Nov/19  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Bernard Gorman Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: qexec-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Execution
Sprint: Query 2019-12-30, Query 2020-01-13
Participants:

 Description   

When shard A's Primary steps down and a new Primary is elected, there is a window of time during which the ReplicaSetMonitor on shards other than A still believe the original node is shard A's Primary. It is possible for a $merge to be issued during this window which is dispatched to shard B and will then fail with a NotMaster exception if it attempts to write to shard A, since the writes will be targeted towards a node which is now Secondary. It appears that it is also possible for a CursorNotFound exception to result due to a NotMasterNoSlaveOk exception on shard A if the $merge is dispatched to a shard as part of the latter half of a split pipeline.



 Comments   
Comment by Bernard Gorman [ 02/Dec/19 ]

david.storch: done. Note that I didn't investigate this much beyond what is outlined in the description; it arose while I was writing a multiversion test, and once I had established what was happening I just worked around it, since it was not relevant to what I was attempting to test.

Comment by David Storch [ 02/Dec/19 ]

bernard.gorman can you provide a description?

Generated at Thu Feb 08 05:07:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.