[SERVER-67530] Loss of shard RS primary can lead to a loss of read availability for a collection after failed migration Created: 24/Jun/22 Updated: 06/Dec/22 Resolved: 14/Jul/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 5.0.3, 6.1.0-rc0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Randolph Tan | Assignee: | [DO NOT USE] Backlog - Sharding EMEA |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | sharding-product-sync | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding EMEA
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Sharding 2022-06-27 | ||||||||
| Participants: | |||||||||
| Description |
|
If a shard's replica set loses its primary node and is unable to elect a new new in the inopportune moment where a failed migration needs to be recovered, then the collection will not be available on that shard. This is because the recipient will clear the filtering metadata when an error occurs. Complete migration will also not succeed since it can't reach the primary of the recipient node, which means that the migration coordinator document will not be deleted. This means that when new requests comes in for the collection, shard recovery will get triggerred because the filtering metadata was cleared earlier. The recovery process will discover that the migration coordinator document is still around and will try to perform complete migration again, but will get timed out trying to look for primary from the RSM. |
| Comments |
| Comment by Garaudy Etienne [ 14/Jul/22 ] |
|
Customer can use read concern level available to do reads. They won’t be able to do w:1 writes without having a majority write availability because cannot release the critical section. As such, closing as "won't fix" |
| Comment by Randolph Tan [ 14/Jul/22 ] |
|
I don't think this is "worked as designed" in the strict sense, I'll leave it to garaudy.etienne@mongodb.com to evaluate whether we are willing to accept this edge case and close it as "won't fix". |
| Comment by Cris Insignares Cuello [ 14/Jul/22 ] |
|
randolph@mongodb.com garaudy.etienne@mongodb.com did we agree that this works as designed? |
| Comment by Randolph Tan [ 29/Jun/22 ] |
|
It can happen from the moment we added the migration coordinator recovery that needs to talk to primary and wait for w: majority which was in v4.4. So any scenario that will lead it to performing recovery during a query when recipient doesn't have primary can experience this issue. I think the test I attched won't be able to demostrate this as migration behaves differently (for example, it doesn't clear metadata on error). |
| Comment by Kaloian Manassiev [ 29/Jun/22 ] |
|
I meant more whether it can happen in older (pre-5.0 versions) versions and what in older versions was preventing it from happening? |
| Comment by Randolph Tan [ 27/Jun/22 ] |
|
Attached test_v6.1.js because it behaves a little bit differently on latest head. The original test.js will hang because the migration will get stuck trying to wait for recipient to release critical section that involves doing a w: majority write. The new test steps down the primary instead of trying to to join the migration. New queries will also get stuck at waiting for the recipient to release critical section when attempting to recover shard version. |
| Comment by Randolph Tan [ 27/Jun/22 ] |
|
kaloian.manassiev@mongodb.com, I think it can happen in master too. The test.js I posted behaved differently so I will update the ticket once I have more conclusive findings for master. v5.0.3 was the version where I was observing this behavior, not necessarily the version that started failing. |
| Comment by Kaloian Manassiev [ 27/Jun/22 ] |
|
randolph@mongodb.com, this says 5.0.3. Do you know what in 5.0.3 specifically caused this problem? This could not have possibly worked correctly before PM-1645, because at that point we would have treated the collection as UNSHARDED, but if an up-to-date router hit it, it would still have triggered recovery. |
| Comment by Randolph Tan [ 24/Jun/22 ] |
|
Added test.js to demonstrate the issue. |