[SERVER-20290] Recipient shard for migration can continue on retrieving data even after donor shard aborts Created: 04/Sep/15 Updated: 17/Nov/16 Resolved: 03/Feb/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.1.7 |
| Fix Version/s: | 3.2.3, 3.3.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Randolph Tan | Assignee: | Dianna Hohensee (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | code-and-test, csrsupgrade | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Completed: | |||||||||||||||||||||||||
| Sprint: | Sharding F (01/29/16), Sharding 10 (02/19/16) | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
The donor does not tell the recipient to abort the migration when it returns early in some cases. Some of them are fine as they are a result of the recipient shard aborting. To make things worse, the _migrateClone and _transferMods doesn't include any parameter indicating what they are requesting, so it seems possible for these command to be pulling data intended for a different migration session. For example, if the donor shard aborts without informing the recipient and then starts donating chunk to another shard. The donor restarting would most likely not exhibit this issue as the recipient shard is using the same connection to talk to the donor for the entire migration. One example of the donor shard aborting is through the killOp interruption points. |
| Comments |
| Comment by Githook User [ 03/Feb/16 ] |
|
Author: {u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}Message: |
| Comment by Githook User [ 03/Feb/16 ] |
|
Author: {u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}Message: Fixing format issue in |
| Comment by Githook User [ 03/Feb/16 ] |
|
Author: {u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}Message: Fixing format issue in |
| Comment by Githook User [ 03/Feb/16 ] |
|
Author: {u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}Message: |
| Comment by Githook User [ 02/Feb/16 ] |
|
Author: {u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}Message: |
| Comment by Kaloian Manassiev [ 25/Jan/16 ] |
|
I propose that we generate a 'migration id' of OID type and use this to identify individual migrations' instances. The migration id will be generated by the moveChunk call (which is what kicks off the migration machinery) and will be assigned to the migration source and destination managers. All migration sequence calls will have to pass it around and it will be checked against the current migration. For backwards compatibility, if an incoming request is missing the migration id, no checking will be performed, but otherwise the migration ids must match. Once all shard nodes are upgraded, all participants will be checking the migration id. |
| Comment by Randolph Tan [ 04/Dec/15 ] |
|
I don't think the recipient shard ever restarts in the current implementation. In the example race given in the description, the recipient shard was simply resuming the migration session that was already aborted by the donor shard. |
| Comment by Andy Schwerin [ 04/Dec/15 ] |
|
If the recipient restarts, what will cause it to resume migrating the chunk from the donor? When must the recipient restart? |