[SERVER-27941] Remove the shardVersion check when entering the critical section. Created: 07/Feb/17 Updated: 30/May/18 Resolved: 24/Feb/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 3.4.15, 3.5.4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dianna Hohensee (Inactive) | Assignee: | Dianna Hohensee (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v3.4
|
||||||||
| Sprint: | Sharding 2017-03-06 | ||||||||
| Participants: | |||||||||
| Case: | (copied to CRM) | ||||||||
| Linked BF Score: | 0 | ||||||||
| Description |
|
We check here that the shardVersion we began the migration with is still the shardVersion. This seems like an unnecessary check now, since shards no longer hold distributed locks. We're running into a race condition when a shard is the recipient of a chunk and then the donor of a chunk in quick succession. This is the scenario:
|
| Comments |
| Comment by Githook User [ 13/Apr/18 ] |
|
Author: {'email': 'dianna.hohensee@10gen.com', 'name': 'Dianna Hohensee', 'username': 'DiannaHohensee'}Message: (cherry picked from commit ce23378926659bc50604032782485c2f962c37ac) |
| Comment by Misha Tyulenev [ 16/Mar/18 ] |
|
To resolve BF-5136 |
| Comment by Githook User [ 24/Feb/17 ] |
|
Author: {u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}Message: |
| Comment by Andy Schwerin [ 09/Feb/17 ] |
|
Thanks. That context is useful. |
| Comment by Dianna Hohensee (Inactive) [ 09/Feb/17 ] |
|
It's unclear what the original intent was. It goes back to this commit (in d_migrate.cpp, "assert( myVersion > shardingState.getVersion( ns ) );") My guess is that it was a check to see if the distributed lock had been overtaken since the migration began, and then whether another migration had happened somewhere else by checking the latest collection version. Since we used to have shards create and commit the new chunk versions, being the only ongoing migration was essential for success. Now it doesn't matter whether the collection version changes. We allow parallel migrations, and the config server serializes the creation and commit of new chunk versions. This check can actually cause parallel running migrations to be aborted unnecessarily, if the chunk metadata has cause to refresh between the migration start and entering the critical section, and another migration elsewhere has committed a new chunk (and thus collectionVersion gets bumped up). |
| Comment by Andy Schwerin [ 08/Feb/17 ] |
|
I could use an explanation for why this check used to be necessary and why it's definitely not necessary anymore. |