[SERVER-29961] Close change notification cursors when a chunk migrates to a new shard Created: 03/Jul/17 Updated: 30/Oct/23 Resolved: 13/Sep/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Siyuan Zhou |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Sprint: | Repl 2017-08-21, Repl 2017-09-11, Repl 2017-10-02 | ||||||||
| Participants: | |||||||||
| Description |
|
When a new shard is added to the cluster, any existing change notification cursors won't have corresponding cursors on that shard, so if chunks migrate to that shard future updates to that data could be missed by existing change notification cursors. To protect against this, any time a shard donates a chunk to another shard, it will first check its copy of the routing table to determine if the destination shard has any chunks for that collection. If not, then it will close all of its own change notification cursors on that collection. This will cause the mongos to close its corresponding cluster change notification cursor, which will in turn cause the driver to retry the change notification request, at which point the new change notification cursor will target all existing shards and include the newly added shard |
| Comments |
| Comment by Ramon Fernandez Marina [ 13/Sep/17 ] | ||||||||||||
|
Author: {'username': u'visualzhou', 'name': u'Siyuan Zhou', 'email': u'siyuan.zhou@mongodb.com'}Message: | ||||||||||||
| Comment by Githook User [ 24/Aug/17 ] | ||||||||||||
|
Author: {'username': 'visualzhou', 'email': 'siyuan.zhou@mongodb.com', 'name': 'Siyuan Zhou'}Message: | ||||||||||||
| Comment by Siyuan Zhou [ 07/Aug/17 ] | ||||||||||||
|
Making the behavior work on secondaries is challenging because only the primary of the donor coordinates chuck migrations. Another subtle problem is when to close the cursor. If we don't close the cursor until the donor updates the routing table on the config server, there might be new updates on the recipient that will be missed when resuming with the resume token returned by the donor. As discussed with spencer and schwerin, we believe this issue will be fixed by two-phase commit since chunk migration will be a transaction involving the donor, the recipient and the config server. We will be able to close the cursor and return the resume token of the commit time. However, without two-phase commit in 3.6, we need to log a no-op message after entering the critical section and before updating the config server. Thus the cluster time of this no-op will be after the cluster time of adding a new shard and before the commit time of chunk migration on the config server due to causal consistency. The no-op message is only logged when the recipient doesn't have any data of the collection before. The proposed no-op message has the following format:
For back-compatibility, we cannot use the "ns" field because it's expected to be empty for no-ops. | ||||||||||||
| Comment by Spencer Brody (Inactive) [ 24/Jul/17 ] | ||||||||||||
|
The problem outlined in my last comment can be fixed by double-checking the set of all shards (by querying the config servers) after establishing cursors on all previously know-about shards, and retrying if any shards have been added. | ||||||||||||
| Comment by Spencer Brody (Inactive) [ 03/Jul/17 ] | ||||||||||||
|
On second thought this approach won't work. There is a race where when establishing a new change notification cursor on a sharded collection a mongos could first get the list of all shards. After getting that list, a shard could be added and a chunk migrated to that shard. If it’s only after that point that the mongos goes on to establish cursors on all shards you can wind up in a state where there is a cursor on all shards but the new one, and the new shard already has a chunk so the system will never notice that it is missing notifications from that shard. |