[SERVER-20290] Recipient shard for migration can continue on retrieving data even after donor shard aborts Created: 04/Sep/15  Updated: 17/Nov/16  Resolved: 03/Feb/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.1.7
Fix Version/s: 3.2.3, 3.3.2

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Dianna Hohensee (Inactive)
Resolution: Done Votes: 0
Labels: code-and-test, csrsupgrade
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-16540 Potential race in migration can have ... Closed
Related
related to SERVER-22498 Fix migration session id multiversion... Closed
is related to SERVER-22351 Write tests for migration session ID Closed
is related to SERVER-22459 Write multiversion test for migration... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Sharding F (01/29/16), Sharding 10 (02/19/16)
Participants:

 Description   

The donor does not tell the recipient to abort the migration when it returns early in some cases. Some of them are fine as they are a result of the recipient shard aborting. To make things worse, the _migrateClone and _transferMods doesn't include any parameter indicating what they are requesting, so it seems possible for these command to be pulling data intended for a different migration session. For example, if the donor shard aborts without informing the recipient and then starts donating chunk to another shard.

The donor restarting would most likely not exhibit this issue as the recipient shard is using the same connection to talk to the donor for the entire migration.

One example of the donor shard aborting is through the killOp interruption points.



 Comments   
Comment by Githook User [ 03/Feb/16 ]

Author:

{u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}

Message: SERVER-20290 Fixing race condition in donor_shard_abort_and_start_new_migration.js between aborting first migration and starting second migration.
Branch: v3.2
https://github.com/mongodb/mongo/commit/ec81d5946837e6ad0c3818837f88a1f3f056248b

Comment by Githook User [ 03/Feb/16 ]

Author:

{u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}

Message: Fixing format issue in SERVER-20290
Branch: v3.2
https://github.com/mongodb/mongo/commit/93934f75b662f51b86af424a51fd691f2992977a

Comment by Githook User [ 03/Feb/16 ]

Author:

{u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}

Message: Fixing format issue in SERVER-20290
Branch: master
https://github.com/mongodb/mongo/commit/1335e35ce45539192475dddb1c82557f5d36d028

Comment by Githook User [ 03/Feb/16 ]

Author:

{u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}

Message: SERVER-20290 Introduce migration session id, with test
Branch: v3.2
https://github.com/mongodb/mongo/commit/c91b1ea56a2619a123876970229556013cea5d9a

Comment by Githook User [ 02/Feb/16 ]

Author:

{u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}

Message: SERVER-20290 Introduce migration session id
Branch: master
https://github.com/mongodb/mongo/commit/d0ae5688ea3083d2916c2213a262ed0ec2cf6b4f

Comment by Kaloian Manassiev [ 25/Jan/16 ]

I propose that we generate a 'migration id' of OID type and use this to identify individual migrations' instances. The migration id will be generated by the moveChunk call (which is what kicks off the migration machinery) and will be assigned to the migration source and destination managers. All migration sequence calls will have to pass it around and it will be checked against the current migration.

For backwards compatibility, if an incoming request is missing the migration id, no checking will be performed, but otherwise the migration ids must match. Once all shard nodes are upgraded, all participants will be checking the migration id.

Comment by Randolph Tan [ 04/Dec/15 ]

I don't think the recipient shard ever restarts in the current implementation. In the example race given in the description, the recipient shard was simply resuming the migration session that was already aborted by the donor shard.

Comment by Andy Schwerin [ 04/Dec/15 ]

If the recipient restarts, what will cause it to resume migrating the chunk from the donor? When must the recipient restart?

Generated at Thu Feb 08 03:53:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.