-
Type: Task
-
Resolution: Gone away
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Sharding
-
Labels:None
-
Sharding
During migration, we persist a critical section counter which is replicated to secondaries to make them clear their filtering metadata so that their next refresh will see the result of the migration. The idea is that when the secondary refreshes it will refresh from the primary by calling forceRoutingTableRefresh on the primary and waiting for it to replicate, which waits for the critical section before refreshing.
However, when we persist that critical section counter, we don't use majority write concern, and we never wait for majority before we commit the migration on the config server.
This means that if we
1. Start a migration
2. Write the critical section counter. Suppose it doesn't get replicated at all.
3. Commit the migration on the config server.
4. Failover
5. A new primary is elected which does not know that a migration has occurred, and could continue serving requests for a router which is equally as stale as the secondary, leading to stale data being read.
We should verify this with a jstest and then fix by persisting the critical section counter with majority write concern.