[SERVER-27941] Remove the shardVersion check when entering the critical section. Created: 07/Feb/17  Updated: 30/May/18  Resolved: 24/Feb/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.4.15, 3.5.4

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Dianna Hohensee (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4
Sprint: Sharding 2017-03-06
Participants:
Case:
Linked BF Score: 0

 Description   

We check here that the shardVersion we began the migration with is still the shardVersion. This seems like an unnecessary check now, since shards no longer hold distributed locks.

We're running into a race condition when a shard is the recipient of a chunk and then the donor of a chunk in quick succession. This is the scenario:

  • There a migration from shard Y to shard X.
  • We release destination shard X before the donor shard Y commits the migration to the config server.
  • Shard X starts acting as a donor shard in a new migration, still before shard Y commits the previous migration.
  • Donor shard X reloads the chunk metadata at the start of the new migration, setting the MigrationSourceManager::_collectionMetadata with that newly loaded chunk metadata.
  • Shard Y finally finishes the commit.
  • Shard X attempts to enter the critical section, checks MigrationSourceManager::_collectionMetadata's shardVersion still matches the shard's current chunk metadata, sees that they're different and fails.


 Comments   
Comment by Githook User [ 13/Apr/18 ]

Author:

{'email': 'dianna.hohensee@10gen.com', 'name': 'Dianna Hohensee', 'username': 'DiannaHohensee'}

Message: SERVER-27941 change the shardVersion check to an epoch check when entering the critical section.

(cherry picked from commit ce23378926659bc50604032782485c2f962c37ac)
Branch: v3.4
https://github.com/mongodb/mongo/commit/f0299d35a0e91246fa95005c78319bcabd43494d

Comment by Misha Tyulenev [ 16/Mar/18 ]

To resolve BF-5136

Comment by Githook User [ 24/Feb/17 ]

Author:

{u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}

Message: SERVER-27941 change the shardVersion check to an epoch check when entering the critical section.
Branch: master
https://github.com/mongodb/mongo/commit/ce23378926659bc50604032782485c2f962c37ac

Comment by Andy Schwerin [ 09/Feb/17 ]

Thanks. That context is useful.

Comment by Dianna Hohensee (Inactive) [ 09/Feb/17 ]

It's unclear what the original intent was. It goes back to this commit (in d_migrate.cpp, "assert( myVersion > shardingState.getVersion( ns ) );")

My guess is that it was a check to see if the distributed lock had been overtaken since the migration began, and then whether another migration had happened somewhere else by checking the latest collection version. Since we used to have shards create and commit the new chunk versions, being the only ongoing migration was essential for success.

Now it doesn't matter whether the collection version changes. We allow parallel migrations, and the config server serializes the creation and commit of new chunk versions. This check can actually cause parallel running migrations to be aborted unnecessarily, if the chunk metadata has cause to refresh between the migration start and entering the critical section, and another migration elsewhere has committed a new chunk (and thus collectionVersion gets bumped up).

Comment by Andy Schwerin [ 08/Feb/17 ]

I could use an explanation for why this check used to be necessary and why it's definitely not necessary anymore.

Generated at Thu Feb 08 04:16:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.