[SERVER-31275] Causal Consistency with secondary reads is broken by chunk migration commit Created: 26/Sep/17 Updated: 30/Oct/23 Resolved: 08/Nov/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.0-rc4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dianna Hohensee (Inactive) | Assignee: | Dianna Hohensee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Sharding 2017-10-23, Sharding 2017-11-13 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||
| Description |
|
Detailed description to follow. Because donor shard secondaries do not block access to the donated chunk during the period of time when the primary is waiting for the config server to acknowledge the chunk migration commit message, they may serve reads of stale data after the recipient shard has accepted the donated chunk and taken writes to it. In the jargon of the sharding team, because donor shard secondaries do not observe the "chunk migration commit critical section", they may serve reads of documents owned by the recipient shard during the time between the primary sending the chunk commit message to the config server and the primary subsequently refreshing its routing table. |
| Comments |
| Comment by Dianna Hohensee (Inactive) [ 08/Nov/17 ] |
|
This also fixed the config.cache.collection OpObserver hooks such that they actually ONLY run on secondaries. Originally |
| Comment by Githook User [ 08/Nov/17 ] |
|
Author: {'name': 'Dianna Hohensee', 'username': 'DiannaHohensee', 'email': 'dianna.hohensee@10gen.com'}Message: |
| Comment by Dianna Hohensee (Inactive) [ 02/Nov/17 ] |
|
Through inspection, found a safe secondary reads bug, where secondaries receive new metadata and invalidate the CatalogCache, but not the active CollectionMetadata, so the secondary will continue to service stale requests without reloading. This will be fixed in this patch, as the new functionality swallows up the bug. |
| Comment by Dianna Hohensee (Inactive) [ 18/Oct/17 ] |
|
Final conclusion after discussion: flag is sent to the secondary (local write probably suffices) on entering critical section, secondary invalidates the routing table on receipt, then when the secondary does refresh it must contact the primary to refresh, at which point the primary will block the forceRoutingTableRefresh command on the critical section. |
| Comment by Dianna Hohensee (Inactive) [ 03/Oct/17 ] |
|
Option 1: extend critical section to secondaries, only in the case of causal consistency's no-op write to the shard primary from a secondary – i.e., causal consistency is the only process that waits on the critical section on secondaries. |
| Comment by Dianna Hohensee (Inactive) [ 27/Sep/17 ] |
|
The shard primary sets and unsets a minOpTimeUpdaters value before and after the migration critical section. afterClusterTime could ensure that the secondary has everything the primary has and then we could check the minOpTimeUpdaters value to check for migrations. The time of the write will be covered by the time resulting from the config server write, so anything with that cluster time would pull in the shard change to a secondary. This is effectively a critical section flag for the secondaries, but collection ambiguous. But this is annoyingly ambiguous about which collection is in the critical section on the primary. Should keep in mind that there will be more than one migration allowed to run on a shard at once in the future. |