[SERVER-45119] CollectionShardingState::getCurrentShardVersionIfKnown returns collection version instead of shard version Created: 12/Dec/19 Updated: 29/Oct/23 Resolved: 06/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.2.5, 4.0.17 |
| Fix Version/s: | 4.2.6, 4.0.18 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Matthew Saltz (Inactive) | Assignee: | Kaloian Manassiev |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | KP42, regression, sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Sprint: | Sharding 2019-12-30, Sharding 2020-04-20 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||
| Description |
|
Issue Status as of Jul 31, 2020 ISSUE DESCRIPTION AND IMPACT A bug in shard version checking causes a race condition between parallel chunk migrations and auto-split activity. If the race condition occurs, an affected shard becomes unable to update its sharding metadata, and operations that require data from that shard will fail. While it is possible for the issue to clear on its own, it is likely to persist until action is taken. DIAGNOSIS AND AFFECTED VERSIONS Sharded clusters with 2 or more shards running MongoDB versions <=4.2.5 and version 4.0.17 are impacted. The bug is much more likely to be triggered on 4.0.17 than other versions, however. If the bug is triggered, client operations will begin failing with "version mismatch detected" (StaleConfig) errors. And, corresponding mongos logs will include "requested shard version differs from config shard version" error messages. REMEDIATION AND WORKAROUNDS If running MongoDB version 4.0.17, downgrade to 4.0.16 or upgrade to 4.0.18 when it becomes available. If running MongoDB version 4.2.5, upgrade to version 4.2.6 when it becomes available. In the event a version change is not possible, this issue can be partially mitigated by:
If you re-enable the balancer, the bug can be triggered again. Note: If the sharded cluster is running with authentication enabled, you would need to grant the internal action on the cluster resource, to run the _flushRoutingTableCacheUpdates command: You could create a new role with the internal privilege on the cluster resource, and then grant this role to the admin user as below. Replace ADMIN_USER with the username for the admin.
FIX VERSIONS 4.2.6 and 4.0.18 original descriptionThis should call getShardVersion() instead of getCollVersion(). It's only usage is here. Fortunately the check here is still valid even though we were returning the collection version. Basically if a shard knows about collection version X, and shard version Y, then it's not possible for the actual shard version to be between X and Y, because otherwise it would know about it. |
| Comments |
| Comment by Linda Qin [ 03/Jul/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Updating the user summary after the discussion with kelsey.schubert. There are two updates:
Here are some scenarios when a sharded cluster with 2 or 3 shards might also be affected by this:
Repro: Scenario 1
Scenario 2
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 06/Apr/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 06/Apr/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 05/Apr/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attaching the SERVER-45119 - Repro.js | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 25/Mar/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I updated the "Affects Version/s" field to indicate that it's not present in 4.4 and later, but I think we should fix this and backport it all the way to the versions in which it is present. While I understand that there is some intricate logic through which it doesn't cause infinite refreshes, it is too brittle of an assumption to make and we can easily break it with some backport we do. matthew.saltz, would you be able to do the backports while we still have our attention on stabilisation? I don't want it to go on the backlog and drag it forever. |