[SERVER-41934] Filtering metadata could be stale and serve queries if stepdown happens during migration Created: 26/Jun/19 Updated: 27/Oct/23 Resolved: 17/Aug/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Matthew Saltz (Inactive) | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Sharding
|
| Participants: |
| Description |
|
During migration, we persist a critical section counter which is replicated to secondaries to make them clear their filtering metadata so that their next refresh will see the result of the migration. The idea is that when the secondary refreshes it will refresh from the primary by calling forceRoutingTableRefresh on the primary and waiting for it to replicate, which waits for the critical section before refreshing. However, when we persist that critical section counter, we don't use majority write concern, and we never wait for majority before we commit the migration on the config server. This means that if we We should verify this with a jstest and then fix by persisting the critical section counter with majority write concern. |
| Comments |
| Comment by Kaloian Manassiev [ 17/Aug/20 ] |
|
Gone away as result of the changes for PM-1645, Milestone 1. |
| Comment by Matthew Saltz (Inactive) [ 21/Oct/19 ] |
|
I think there's another solution to this. We can move this write to prior to the writes critical section and make it a majority write, and then add a different portion of the critical section for blocking secondary reads. In other words, currently, the flow on the donor is like:
I think we can add a new "block secondary reads (BSR)" phase so it looks like:
We could also just have a separate concept from the critical section that gives secondary refreshes something to block on. Thanks to schwerin for pointing out that we probably don't need to do this write during the critical section, which I think is true. |
| Comment by Matthew Saltz (Inactive) [ 31/Jul/19 ] |
|
Just to follow up: I think majority writing the critical section counter would by far be the simplest and easiest to understand fix. kaloian.manassiev are you confident that this would be too expensive? Do you think it's necessary to go with the 'clear metadata for all collections' approach? I think we'd also have to think carefully about that approach especially given the recent discussion around clearing the _receivingChunks list when we clear the active metadata. It could also cause a "refresh storm" after they are all cleared. Is that a concern? |
| Comment by Esha Maharishi (Inactive) [ 11/Jul/19 ] |
|
Ah, that's true :/ |
| Comment by Matthew Saltz (Inactive) [ 11/Jul/19 ] |
|
I think that would race with secondaries updating their routing tables from the primary which blocks behind this. |
| Comment by Esha Maharishi (Inactive) [ 11/Jul/19 ] |
|
I wonder if we could write the enterCriticalSectionCounter flag just before the primary enters the critical section, the way we do the minOpTimeRecovery document. This means secondaries would enter the critical section as soon as they see the flag, but the primary would not until the flag has majority replicated. We could also wait for majority just once after writing both the minOpTime and enterCriticalSectionCounter flag. |
| Comment by Matthew Saltz (Inactive) [ 11/Jul/19 ] |
|
I had talked to kaloian.manassiev shortly after this ticket was filed, and he suggested that making the critical section counter write wait for majority write concern could be too costly since it happens in the critical section. So I think he suggested that when recovering the sharding state, if we see that there were in flight migrations we should clear out the filtering metadata for all collections. Does this sound problematic? |
| Comment by Kaloian Manassiev [ 02/Jul/19 ] |
|
renctan pointed me to the mistake in my reasoning yesterday - having an outstanding minOpTime recovery increment after a step-up only instructs the shard to sync-up its config server opTime, but doesn't cause a refresh of the collection sharding state. Because of this a node becoming a primary can still continue accept reads and writes at a stale version. |
| Comment by Esha Maharishi (Inactive) [ 01/Jul/19 ] |
|
kaloian.manassiev I still think there is a bug: having a non-empty minOpTime on stepup simply makes a new primary get the latest configOpTime, so that later refreshes will use the latest configOpTime. It does not cause the new primary to synchronously force refreshes for any collections before accepting requests. |
| Comment by Randolph Tan [ 01/Jul/19 ] |
|
kaloian.manassiev The write to recovery document happens before the update to the config collection that will cause the secondary to refresh. So if the secondary did not see that write when it becomes the primary, it can have a stale collection metadata even if the config opTime is up to date since nothing will prompt it to talk to the config server. |
| Comment by Kaloian Manassiev [ 01/Jul/19 ] |
matthew.saltz, I don't think this will actually happen. The donor does two more activities as part of committing the migration and these are: Therefore, if after step (5) above a new primary is elected, which never sees the write at step (2), but the migration is committed, it must at least see the increment of the minOpTime recovery document and will therefore sync with the config server and catch-up to the correct shard version. Alternatively, if it happens to see the decrement from step (7), it must also have seen the write from step (2) and cleared its metadata in order to force refresh, so it won't be accepting writes at the old version. I agree though that the algorithm is not very well described or commented in the code and it took me a while to read that through. Could we maybe formalize it a bit as part of the new writes, which you are adding as part of the Range Deleter project? |