[SERVER-67530] Loss of shard RS primary can lead to a loss of read availability for a collection after failed migration Created: 24/Jun/22  Updated: 06/Dec/22  Resolved: 14/Jul/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.0.3, 6.1.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Won't Fix Votes: 0
Labels: sharding-product-sync
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File test.js     File test_v6.1.js    
Issue Links:
Related
is related to DOCS-15461 [Server] Remove suggestion to use arb... Closed
Assigned Teams:
Sharding EMEA
Operating System: ALL
Sprint: Sharding 2022-06-27
Participants:

 Description   

If a shard's replica set loses its primary node and is unable to elect a new new in the inopportune moment where a failed migration needs to be recovered, then the collection will not be available on that shard.

This is because the recipient will clear the filtering metadata when an error occurs. Complete migration will also not succeed since it can't reach the primary of the recipient node, which means that the migration coordinator document will not be deleted. This means that when new requests comes in for the collection, shard recovery will get triggerred because the filtering metadata was cleared earlier. The recovery process will discover that the migration coordinator document is still around and will try to perform complete migration again, but will get timed out trying to look for primary from the RSM.



 Comments   
Comment by Garaudy Etienne [ 14/Jul/22 ]

Customer can use read concern level available to do reads. They won’t be able to do w:1 writes without having a majority write availability because cannot release the critical section. 

As such, closing as "won't fix"

Comment by Randolph Tan [ 14/Jul/22 ]

I don't think this is "worked as designed" in the strict sense, I'll leave it to garaudy.etienne@mongodb.com to evaluate whether we are willing to accept this edge case and close it as "won't fix".

Comment by Cris Insignares Cuello [ 14/Jul/22 ]

randolph@mongodb.com garaudy.etienne@mongodb.com did we agree that this works as designed?

Comment by Randolph Tan [ 29/Jun/22 ]

It can happen from the moment we added the migration coordinator recovery that needs to talk to primary and wait for w: majority which was in v4.4. So any scenario that will lead it to performing recovery during a query when recipient doesn't have primary can experience this issue. I think the test I attched won't be able to demostrate this as migration behaves differently (for example, it doesn't clear metadata on error).

Comment by Kaloian Manassiev [ 29/Jun/22 ]

I meant more whether it can happen in older (pre-5.0 versions) versions and what in older versions was preventing it from happening?

Comment by Randolph Tan [ 27/Jun/22 ]

Attached test_v6.1.js because it behaves a little bit differently on latest head. The original test.js will hang because the migration will get stuck trying to wait for recipient to release critical section that involves doing a w: majority write. The new test steps down the primary instead of trying to to join the migration. New queries will also get stuck at waiting for the recipient to release critical section when attempting to recover shard version.

Comment by Randolph Tan [ 27/Jun/22 ]

kaloian.manassiev@mongodb.com, I think it can happen in master too. The test.js I posted behaved differently so I will update the ticket once I have more conclusive findings for master. v5.0.3 was the version where I was observing this behavior, not necessarily the version that started failing.

Comment by Kaloian Manassiev [ 27/Jun/22 ]

randolph@mongodb.com, this says 5.0.3. Do you know what in 5.0.3 specifically caused this problem? This could not have possibly worked correctly before PM-1645, because at that point we would have treated the collection as UNSHARDED, but if an up-to-date router hit it, it would still have triggered recovery.

Comment by Randolph Tan [ 24/Jun/22 ]

Added test.js to demonstrate the issue.

Generated at Thu Feb 08 06:08:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.